Call variants with samtools 1.0
Call chr21 variants from newly mapped BAM data using samtools (htslib) and bcftools (htslib) OR varscan2
- 1 Introduction
- 2 Workflow
- 2.1 Map reads to the reference genome
- 2.2 Clean the mapping data from mate errors
- 2.3 Call variants with 'samtools mpileup' & 'bcftools'
- 2.4 Call with 'samtools mpileup' & 'Varscan2'
- 3 Comparing BWA & Bowtie2 results
- 4 Conclusion
This page content is largely inspired from the official htslib page and is dedicated to one single step in the variant analysis workflow and to the recently released new version of Samtols based on the htslib. Commands presented here will differ from former commands using samtools until version 0.19 which was based on other C libraries. When calling several genomes together (replicates, family members, or patient control cohorts), the commands can be adapted by providing several BAM files. The call from several genomes should be done simultaneously in order to include full call information across all input genomes (no-calls, ref-calls, half-calls).
Map reads to the reference genome
We chose to take only chr21 reads for this experiment in order to speed up the process while retaining full genome as a reference.
This part is not detailed here in order to come to the point. Both BWA and Bowtie2 were used with standard parameters to generate a SAM file for each mapper. The mapping was performed on a Centos computer with 8 threads and a maximum of 4GB RAM to provide material reproducible by the average user.
|BWA monitoring||Bowtie2 monitoring|
Time recording of each mapping returned values reproduced here as reference and showing no speed advantage for Bowtie with ~15min for bwa and 17 min for bowtie2. This is in contradiction to what has been reported by others using older versions of both BWA and Bowtie.
Clean the mapping data from mate errors
clean up read pairing information and flags and sort
Samtools mpileup needs the BAM file be position-sorted instead of unsorted or read-name-sorted as obtained from some processing tools like Picard. The following command will sort a BAM files to comply. Furthermore, data obtained from bowtie or BWA may have issues with mated reads. It is best to fix these potential issues now.
# sort reads by name required by the next command samtools sort -n -m 12G -@ 8 -O bam -T </tmp/sort1> -o <name-sorted.bam> <mapped-reads.sam> # fix mate information and output to bam samtools fixmate -O bam <name-sorted.bam> <fixmate.bam> # sort the bam data by position samtools sort -m 12G -@ 8 -O bam -T </tmp/sort2> -o <pos-sorted.bam> <fixmate.bam>
Evaluate the data with 'samtools flagstat' during the cleaning process
Nothing changed during the fixmate process which is a good thing.
Mark duplicates using Picard tools
Read duplicates (PCR duplicates) can arise from excessive PCR amplification of the sequencing library. Similarly, optical duplicates can result from re-scanning the same read cluster on the Illumina chip and create distinct read entries from the same physical spot. Both types of duplicates contribute to the amplification of sequencing error and create false positive calls and should best be 'marked' (but not removed) before proceeding with the analysis. Duplicate marking can be done using several existing tools but most often with Picard (here Picard version 119).
# requirements: input is coordinate-sorted java -jar $PICARD/MarkDuplicates.jar \ I=<pos-sorted.bam> \ O=<pos-sorted_mdup.bam> \ M=<MarkDuplicate_metrics.txt> \ REMOVE_DUPLICATES=false \ AS=true
|BWA duplicates||Bowtie2 duplicates|
LIBRARY lib-NA18507 UNPAIRED_READS_EXAMINED 17268 READ_PAIRS_EXAMINED 7477196 UNMAPPED_READS 18534 UNPAIRED_READ_DUPLICATES 4726 READ_PAIR_DUPLICATES 5467 READ_PAIR_OPTICAL_DUPLICATES 1311 PERCENT_DUPLICATION 0.001046 ESTIMATED_LIBRARY_SIZE 6721383713
LIBRARY Unknown-Library UNPAIRED_READS_EXAMINED 83869 READ_PAIRS_EXAMINED 7387318 UNMAPPED_READS 131689 UNPAIRED_READ_DUPLICATES 24272 READ_PAIR_DUPLICATES 3371 READ_PAIR_OPTICAL_DUPLICATES 1110 PERCENT_DUPLICATION 0.002087 ESTIMATED_LIBRARY_SIZE 12062126176
As seen above, BWA results in less unmapped reads with 18'534 against 131'689 for bowtie2. Other metrics differ and are left to the reader to evaluate. More BAM QC can be performed using different tools and is not detailed further.
Call variants with 'samtools mpileup' & 'bcftools'
Perform local re-alignment of reads and output to BCF and VCF
## cryptic parameters below read as follows: ## samtools mpileup # -u, --uncompressed generate uncompressed VCF/BCF output # -g, --BCF generate genotype likelihoods in BCF format # -f, --fasta-ref FILE faidx indexed reference sequence file ## bcftools # -v, --variants-only output variant sites only # -m, --multiallelic-caller alternative model for multiallelic and rare-variant calling (conflicts with -c) # -O, --output-type <b|u|z|v> output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v] # -o, --output <file> write output to a file [standard output] samtools mpileup -ugf <ref.fa> <pos-sorted_mdup.bam> | bcftools call -vmO z -o <results.vcf.gz> # the same command split in two steps would be # -o, --output FILE write output to FILE [standard output] # samtools mpileup -go <results.bcf> -f <ref.fa> <pos-sorted_mdup.bam> # bcftools call -vmO z -o <results.vcf.gz> <results.bcf> # index the results for fast query tabix -p vcf <results.vcf.gz>
review BCF and VCF results
plot statistics from the bcftools calls
bcftools stats -F <ref.fa> -s - <results.vcf.gz> > <results.vcf.gz.stats> mkdir bcftools_plots plot-vcfstats -p bcftools_plots/ <results.vcf.gz.stats>
Add filter field to flag lower quality data
bcftools filter -O z -o <results-filtered.vcf.gz> -s LOWQUAL -i'%QUAL>10' <results.vcf.gz>
Call with 'samtools mpileup' & 'Varscan2'
Varscan2 can be used to call variants from a samtools pileup in heuristic/statistic manner rather than using a probabilistic methods like bcftools. Reports claim that varscan2 can be as efficient as bcftools for this last step. We therefore provide the varscan2 command with default settings to allow users make their own mind. Please read more about this software on the varscan official site.
Call Germline variants from a mpileup
# lines starting with '--' show here default values but can be altered to fit user needs # samples can be renamed using the '--vcf-sample-list' parameter ## process with pipes to save disk space samtools mpileup -f <ref.fa> <pos-sorted_mdup.bam> | \ java -Xmx8G -jar $BIOTOOLS/varscan2/VarScan.v2.3.7.jar mpileup2cns \ --min-coverage 8 \ --min-reads2 2 \ --min-avg-qual 15 \ --min-var-freq 0.01 \ --min-freq-for-hom 0.75 \ --p-value 99e-02 \ --strand-filter 1 \ --variants \ --output-vcf 1 | \ bgzip -c > varscan-results.vcf.gz # same commands in sequence # note that we do not repeat the default options here to keep it short #samtools mpileup -f <ref.fa> <pos-sorted_mdup.bam> > myData.pileup #cat myData.pileup | java -Xmx8G -jar $BIOTOOLS/varscan2/VarScan.v2.3.7.jar mpileup2cns \ # --variants \ # --output-vcf 1 | \ # bgzip -c > <varscan-results.vcf.gz> # index results tabix -p vcf <varscan-results.vcf.gz>
plot statistics from the varscan2 calls
bcftools stats -F <ref.fa> -s - <varscan-results.vcf.gz> > <varscan-results.vcf.gz.stats> mkdir varscan_plots plot-vcfstats -p varscan_plots/ <varscan-results.vcf.gz.stats>
Comparing BWA & Bowtie2 results
The main question motivating this analysis was to quantify the impact of the choice of the mapper on the final result being the list of predicted variants in the NA18507 genome. The data selected for this experiment is ideal in some aspects and biased in others as detailed below.
- this SRA experiment is of high coverage
- the reads are paired which facilitates their unambiguous mapping
- the read length is sufficient to favor unique mapping
- all steps were performed in one go using the same software tools and identical reference genome (hg19)
- bowtie2 is used here and is more compared to bwa than bowtie1 usually reported in similar comparison studies
- the reads used here represent the chr21 subset of all reads mapped using different tools in a former mapping (against hg18) and therefore lack unmapped reads or ambiguous reads that could slightly change the results
- additional cleanup using GATK would be good to improve the indel part of the calls and reduce FPR in the final results (but should only slightlty change the global results)
intersecting results with VCFTools
We fist extract the chr21 calls from each vcf file to compare apples with apples
We then intersect the 4 files using vcftools to get data for a Venn plot
comparing results to the validated data for NA18507
Intersecting calls should be complemented by the validation of the randomly selected variant calls. We present here only the intersection with HapMap3 data obtained from the Broad ftp site.
Comparing bowtie and bwa results to HapMap3 validated variants
Several conclusion can be drawn from these results:
- the recall between both Bowtie2 ands Bwa mapping data is quite high and this is comforting for users prefering one maper tothe other. Keep however in mind that this is an ideal dataset with high amount of reads and that comes from pre-mapped data and therefore does not contain many difficult reads.
- the variants validated by HapMap are found in very similar wether the data is obtained from bowtie or bwa.
- the choice of the caller does not seem to cause large differences in count or recall by HapMap, suggesting that users may choose between bcftools or varscan2
- Varscan2 calls less private variants and is 'seems' less noisy. It is likely that filtering out low quality mappings before calling could reverse this.
- one additional control should be comparing all these calls with those obtained from the GATK post-processed workflow. This goes far beyond the simple aim of this experiment and some elements of answer can be found in our separate hands-on results presented in Hands-on_introduction_to_NGS_variant_analysis