Call variants with samtools 1.0
Call chr21 variants from newly mapped BAM data using samtools (htslib) and bcftools (htslib) OR varscan2
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis ]
Contents
Introduction
This page content is largely inspired from the official htslib page[1] and is dedicated to one single step in the variant analysis workflow and to the recently released new version of Samtols based on the htslib. Commands presented here will differ from former commands using samtools until version 0.19 which was based on other C libraries. When calling several genomes together (replicates, family members, or patient control cohorts), the commands can be adapted by providing several BAM files. The call from several genomes should be done simultaneously in order to include full call information across all input genomes (no-calls, ref-calls, half-calls).
Samtools mpileup was designed to process DIPLOID genomes, if your genome is not diploid, please consider using better adapted tools (eg. Broad GATK)
file names shown between '<>' can be replaced by your own. If you copy the code, please remove the '<' and <>' delimiters
Workflow
Map reads to the reference genome
We chose to take only chr21 reads for this experiment in order to speed up the process while retaining full genome as a reference.
This part is not detailed here in order to come to the point. Both BWA and Bowtie2 were used with standard parameters to generate a SAM file for each mapper. The mapping was performed on a Centos computer with 8 threads and a maximum of 4GB RAM to provide material reproducible by the average user.
mapping commands
## Bowtie2 mapping command bowtie2 -p 8 -x /data/biodata/bowtie-index/HiSeq_UCSC_hg19 -q \ -1 PE_NA18507_GAIIx_100_1.fq.gz -2 PE_NA18507_GAIIx_100_2.fq.gz --phred33 --fr -S bowtie_mapped-reads.sam # Real time: 991 sec ## BWA mapping command bwa mem -R @RG\tID:NA18507\tLB:lib-NA18507\tPU:unkn-0.0\tPL:ILLUMINA\tSM:GAIIx-chr21-BWA.mem \ -M -t 8 ref/HiSeq_UCSC_hg19.fa PE_NA18507_GAIIx_100_1.fq.gz PE_NA18507_GAIIx_100_2.fq.gz # Real time: 922 sec
BWA monitoring | Bowtie2 monitoring |
---|---|
Time recording of each mapping returned values reproduced here as reference and showing no speed advantage for Bowtie with ~15min for bwa and 17 min for bowtie2. This is in contradiction to what has been reported by others using older versions of both BWA and Bowtie.
Clean the mapping data from mate errors
We use here NA18507 chr21 data introduced in the Hands-on_introduction_to_NGS_variant_analysis NGS_Exercise.5 and mapped using bowtie2
clean up read pairing information and flags and sort
Samtools mpileup needs the BAM file be position-sorted instead of unsorted or read-name-sorted as obtained from some processing tools like Picard. The following command will sort a BAM files to comply. Furthermore, data obtained from bowtie or BWA may have issues with mated reads. It is best to fix these potential issues now.
# sort reads by name required by the next command samtools sort -n -m 12G -@ 8 -O bam -T </tmp/sort1> -o <name-sorted.bam> <mapped-reads.sam> # fix mate information and output to bam samtools fixmate -O bam <name-sorted.bam> <fixmate.bam> # sort the bam data by position samtools sort -m 12G -@ 8 -O bam -T </tmp/sort2> -o <pos-sorted.bam> <fixmate.bam>
samtools sort command
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 to 9 [-1]
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-n Sort by read name
-o FILE Write final output to FILE rather than standard output
-O FORMAT Write output as FORMAT ('sam'/'bam'/'cram') (either -O or
-T PREFIX Write temporary files to PREFIX.nnnn.bam -T is required)
-@ INT Set number of sorting and compression threads [1]
Legacy usage: samtools sort [options...] <in.bam> <out.prefix>
Options:
-f Use <out.prefix> as full final filename rather than prefix
-o Write final output to stdout rather than <out.prefix>.bam
-l,m,n,@ Similar to corresponding options above
samtools fixmate command
Usage: samtools fixmate <in.nameSrt.bam> <out.nameSrt.bam>
Options:
-r Remove unmapped reads and secondary alignments
-p Disable FR proper pair check
-O FORMAT Write output as FORMAT ('sam'/'bam'/'cram')
As elsewhere in samtools, use '-' as the filename for stdin/stdout. The input
file must be grouped by read name (e.g. sorted by name). Coordinated sorted
input is not accepted.
Evaluate the data with 'samtools flagstat' during the cleaning process
samtools flagstat output before fixing mates
14990194 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
14858505 + 0 mapped (99.12%:-nan%)
14990194 + 0 paired in sequencing
7495097 + 0 read1
7495097 + 0 read2
14725068 + 0 properly paired (98.23%:-nan%)
14774636 + 0 with itself and mate mapped
83869 + 0 singletons (0.56%:-nan%)
38992 + 0 with mate mapped to a different chr
26405 + 0 with mate mapped to a different chr (mapQ>=5)
samtools flagstat output after fixing mates
14990194 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
14858505 + 0 mapped (99.12%:-nan%)
14990194 + 0 paired in sequencing
7495097 + 0 read1
7495097 + 0 read2
14725068 + 0 properly paired (98.23%:-nan%)
14774636 + 0 with itself and mate mapped
83869 + 0 singletons (0.56%:-nan%)
38992 + 0 with mate mapped to a different chr
26405 + 0 with mate mapped to a different chr (mapQ>=5)
Nothing changed during the fixmate process which is a good thing.
Mark duplicates using Picard tools
Read duplicates (PCR duplicates) can arise from excessive PCR amplification of the sequencing library. Similarly, optical duplicates can result from re-scanning the same read cluster on the Illumina chip and create distinct read entries from the same physical spot. Both types of duplicates contribute to the amplification of sequencing error and create false positive calls and should best be 'marked' (but not removed) before proceeding with the analysis. Duplicate marking can be done using several existing tools but most often with Picard (here Picard version 119).
# requirements: input is coordinate-sorted java -jar $PICARD/MarkDuplicates.jar \ I=<pos-sorted.bam> \ O=<pos-sorted_mdup.bam> \ M=<MarkDuplicate_metrics.txt> \ REMOVE_DUPLICATES=false \ AS=true
Picard markduplicates command
Documentation: http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates
Examines aligned records in the supplied SAM or BAM file to locate duplicate molecules.
All records are then written to the output file with the duplicate records flagged.
Version: 1.119(d44cdb51745f5e8075c826430a39d8a61f1dd832_1408991805)
Options:
--help
-h Displays options specific to this tool.
--stdhelp
-H Displays options specific to this tool AND options common to all Picard command line
tools.
--version Displays program version.
INPUT=File
I=File One or more input SAM or BAM files to analyze. Must be coordinate sorted.
May not be a stream because file is read twice. This option may be specified 0 or more times.
OUTPUT=File
O=File The output file to write marked records to Required.
METRICS_FILE=File
M=File File to write duplication metrics to Required.
PROGRAM_RECORD_ID=String
PG=String The program record ID for the @PG record(s) created by this program. Set to null to
disable PG record creation. This string may have a suffix appended to avoid collision
with other program record IDs. Default value: MarkDuplicates. This option can be set to
'null' to clear the default value.
PROGRAM_GROUP_VERSION=String
PG_VERSION=String Value of VN tag of PG record to be created. If not specified, the version will be
detected automatically. Default value: null.
PROGRAM_GROUP_COMMAND_LINE=String
PG_COMMAND=String Value of CL tag of PG record to be created. If not supplied the command line will be
detected automatically. Default value: null.
PROGRAM_GROUP_NAME=String
PG_NAME=String Value of PN tag of PG record to be created. Default value: MarkDuplicates. This option
can be set to 'null' to clear the default value.
COMMENT=String
CO=String Comment(s) to include in the output file's header. This option may be specified 0 or
more times.
REMOVE_DUPLICATES=Boolean If true do not write duplicates to the output file instead of writing them with
appropriate flags set. Default value: false. This option can be set to 'null' to clear
the default value. Possible values: {true, false}
ASSUME_SORTED=Boolean
AS=Boolean If true, assume that the input file is coordinate sorted even if the header says
otherwise. Default value: false. This option can be set to 'null' to clear the default
value. Possible values: {true, false}
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=Integer
MAX_SEQS=Integer This option is obsolete. ReadEnds will always be spilled to disk. Default value: 50000.
This option can be set to 'null' to clear the default value.
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=Integer
MAX_FILE_HANDLES=Integer Maximum number of file handles to keep open when spilling read ends to disk. Set this
number a little lower than the per-process maximum number of file that may be open. This
number can be found by executing the 'ulimit -n' command on a Unix system. Default
value: 8000. This option can be set to 'null' to clear the default value.
SORTING_COLLECTION_SIZE_RATIO=Double
This number, plus the maximum RAM available to the JVM, determine the memory footprint
used by some of the sorting collections. If you are running out of memory, try reducing
this number. Default value: 0.25. This option can be set to 'null' to clear the default
value.
READ_NAME_REGEX=String Regular expression that can be used to parse read names in the incoming SAM file. Read
names are parsed to extract three variables: tile/region, x coordinate and y coordinate.
These values are used to estimate the rate of optical duplication in order to give a more
accurate estimated library size. Set this option to null to disable optical duplicate
detection. The regular expression should contain three capture groups for the three
variables, in order. It must match the entire read name. Note that if the default regex
is specified, a regex match is not actually done, but instead the read name is split on
colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be
tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements
are assumed to be tile, x and y values. Default value:
[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*. This option can be set to 'null' to
clear the default value.
OPTICAL_DUPLICATE_PIXEL_DISTANCE=Integer
The maximum offset between two duplicte clusters in order to consider them optical
duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels)
unless using later versions of the Illumina pipeline that multiply pixel values by 10, in
which case 50-100 is more normal. Default value: 100. This option can be set to 'null'
to clear the default value.
BWA duplicates | Bowtie2 duplicates |
---|---|
LIBRARY lib-NA18507 UNPAIRED_READS_EXAMINED 17268 READ_PAIRS_EXAMINED 7477196 UNMAPPED_READS 18534 UNPAIRED_READ_DUPLICATES 4726 READ_PAIR_DUPLICATES 5467 READ_PAIR_OPTICAL_DUPLICATES 1311 PERCENT_DUPLICATION 0.001046 ESTIMATED_LIBRARY_SIZE 6721383713 |
LIBRARY Unknown-Library UNPAIRED_READS_EXAMINED 83869 READ_PAIRS_EXAMINED 7387318 UNMAPPED_READS 131689 UNPAIRED_READ_DUPLICATES 24272 READ_PAIR_DUPLICATES 3371 READ_PAIR_OPTICAL_DUPLICATES 1110 PERCENT_DUPLICATION 0.002087 ESTIMATED_LIBRARY_SIZE 12062126176 |
As seen above, BWA results in less unmapped reads with 18'534 against 131'689 for bowtie2. Other metrics differ and are left to the reader to evaluate. More BAM QC can be performed using different tools and is not detailed further.
Call variants with 'samtools mpileup' & 'bcftools'
Perform local re-alignment of reads and output to BCF and VCF
## cryptic parameters below read as follows: ## samtools mpileup # -u, --uncompressed generate uncompressed VCF/BCF output # -g, --BCF generate genotype likelihoods in BCF format # -f, --fasta-ref FILE faidx indexed reference sequence file ## bcftools # -v, --variants-only output variant sites only # -m, --multiallelic-caller alternative model for multiallelic and rare-variant calling (conflicts with -c) # -O, --output-type <b|u|z|v> output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v] # -o, --output <file> write output to a file [standard output] samtools mpileup -ugf <ref.fa> <pos-sorted_mdup.bam> | bcftools call -vmO z -o <results.vcf.gz> # the same command split in two steps would be # -o, --output FILE write output to FILE [standard output] # samtools mpileup -go <results.bcf> -f <ref.fa> <pos-sorted_mdup.bam> # bcftools call -vmO z -o <results.vcf.gz> <results.bcf> # index the results for fast query tabix -p vcf <results.vcf.gz>
samtools mpileup command
Usage: samtools mpileup [options] in1.bam [in2.bam [...]]
Input options:
-6, --illumina1.3+ quality is in the Illumina-1.3+ encoding
-A, --count-orphans do not discard anomalous read pairs
-b, --bam-list FILE list of input BAM filenames, one per line
-B, --no-BAQ disable BAQ (per-Base Alignment Quality)
-C, --adjust-MQ INT adjust mapping quality; recommended:50, disable:0 [0]
-d, --max-depth INT max per-BAM depth; avoids excessive memory usage [250]
-E, --redo-BAQ recalculate BAQ on the fly, ignore existing BQs
-f, --fasta-ref FILE faidx indexed reference sequence file
-G, --exclude-RG FILE exclude read groups listed in FILE
-l, --positions FILE skip unlisted positions (chr pos) or regions (BED)
-q, --min-MQ INT skip alignments with mapQ smaller than INT [0]
-Q, --min-BQ INT skip bases with baseQ/BAQ smaller than INT [13]
-r, --region REG region in which pileup is generated
-R, --ignore-RG ignore RG tags (one BAM = one sample)
--rf, --incl-flags STR|INT required flags: skip reads with mask bits unset []
--ff, --excl-flags STR|INT filter flags: skip reads with mask bits set
[UNMAP,SECONDARY,QCFAIL,DUP]
-x, --ignore-overlaps disable read-pair overlap detection
Output options:
-o, --output FILE write output to FILE [standard output]
-g, --BCF generate genotype likelihoods in BCF format
-v, --VCF generate genotype likelihoods in VCF format
Output options for mpileup format (without -g/-v):
-O, --output-BP output base positions on reads
-s, --output-MQ output mapping quality
Output options for genotype likelihoods (when -g/-v is used):
-t, --output-tags LIST optional tags to output: DP,DPR,DV,DP4,INFO/DPR,SP []
-u, --uncompressed generate uncompressed VCF/BCF output
SNP/INDEL genotype likelihoods options (effective with -g/-v):
-e, --ext-prob INT Phred-scaled gap extension seq error probability [20]
-F, --gap-frac FLOAT minimum fraction of gapped reads [0.002]
-h, --tandem-qual INT coefficient for homopolymer errors [100]
-I, --skip-indels do not perform indel calling
-L, --max-idepth INT maximum per-sample depth for INDEL calling [250]
-m, --min-ireads INT minimum number gapped reads for indel candidates [1]
-o, --open-prob INT Phred-scaled gap open seq error probability [40]
-p, --per-sample-mF apply -m and -F per-sample for increased sensitivity
-P, --platforms STR comma separated list of platforms for indels [all]
Notes: Assuming diploid individuals.
bcftools call command
About: SNP/indel variant calling from VCF/BCF. To be used in conjunction with samtools mpileup.
This command replaces the former "bcftools view" caller. Some of the original
functionality has been temporarily lost in the process of transition to htslib,
but will be added back on popular demand. The original calling model can be
invoked with the -c option.
Usage: bcftools call [options] <in.vcf.gz>
File format options:
-o, --output <file> write output to a file [standard output]
-O, --output-type <b|u|z|v> output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-s, --samples <list> list of samples to include [all samples]
-S, --samples-file <file> PED file or a file with optional second column for ploidy (0, 1 or 2) [all samples]
-t, --targets <region> similar to -r but streams rather than index-jumps
-T, --targets-file <file> similar to -R but streams rather than index-jumps
Input/output options:
-A, --keep-alts keep all possible alternate alleles at variant sites
-f, --format-fields <list> output format fields: GQ,GP (lowercase allowed) []
-g, --gvcf <minDP> output gVCF blocks of homozygous REF calls
-i, --insert-missed output also sites missed by mpileup but present in -T
-M, --keep-masked-ref keep sites with masked reference allele (REF=N)
-V, --skip-variants <type> skip indels/snps
-v, --variants-only output variant sites only
Consensus/variant calling options:
-c, --consensus-caller the original calling method (conflicts with -m)
-C, --constrain <str> one of: alleles, trio (see manual)
-m, --multiallelic-caller alternative model for multiallelic and rare-variant calling (conflicts with -c)
-n, --novel-rate <float>,[...] likelihood of novel mutation for constrained trio calling, see man page for details [1e-8,1e-9,1e-9]
-p, --pval-threshold <float> variant if P(ref|D)<FLOAT with -c [0.5]
-P, --prior <float> mutation rate [1e-3]
-X, --chromosome-X haploid output for male samples (requires PED file with -s)
-Y, --chromosome-Y haploid output for males and skips females (requires PED file with -s)
tabix command
Version: 1.0
Usage: tabix [OPTIONS] [FILE] [REGION [...]]
Options:
-0, --zero-based coordinates are zero-based
-b, --begin INT column number for region start [4]
-c, --comment CHAR skip comment lines starting with CHAR [null]
-e, --end INT column number for region end (if no end, set INT to -b) [5]
-f, --force overwrite existing index without asking
-h, --print-header print also the header lines
-H, --only-header print only the header lines
-l, --list-chroms list chromosome names
-m, --min-shift INT set the minimal interval size to 1<<INT; 0 for the old tabix index [0]
-p, --preset STR gff, bed, sam, vcf, bcf, bam
-r, --reheader FILE replace the header with the content of FILE
-s, --sequence INT column number for sequence names (suppressed by -p) [1]
-S, --skip-lines INT skip first INT lines [0]
review BCF and VCF results
top of the bowtie2 BCF data
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.0+htslib-1.0
##samtoolsCommand=samtools mpileup -go results.bcf -f ../ref/HiSeq_UCSC_hg19.fa pos-sorted.bam
##reference=file://../ref/HiSeq_UCSC_hg19.fa
##contig=<ID=chrM,length=16571>
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##ALT=<ID=X,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version=3>
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##bcftools_viewVersion=1.0+htslib-1.0
##bcftools_viewCommand=view results.bcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT pos-sorted.bam
chr1 9986 . N C,<X> 0 . DP=1;I16=0,0,1,0,0,0,38,1444,0,0,2,4,0,0,0,0;QS=0,1,0;SGB=-0.379885;MQ0F=0 PL 4,3,0,4,3,4
chr1 9987 . N T,<X> 0 . DP=1;I16=0,0,1,0,0,0,38,1444,0,0,2,4,0,0,1,1;QS=0,1,0;SGB=-0.379885;MQ0F=0 PL 4,3,0,4,3,4
chr1 9988 . N A,<X> 0 . DP=1;I16=0,0,1,0,0,0,38,1444,0,0,2,4,0,0,2,4;QS=0,1,0;SGB=-0.379885;MQ0F=0 PL 4,3,0,4,3,4
chr1 9989 . N A,<X> 0 . DP=1;I16=0,0,1,0,0,0,38,1444,0,0,2,4,0,0,3,9;QS=0,1,0;SGB=-0.379885;MQ0F=0 PL 4,3,0,4,3,4
chr1 9990 . N C,<X> 0 . DP=1;I16=0,0,1,0,0,0,38,1444,0,0,2,4,0,0,4,16;QS=0,1,0;SGB=-0.379885;MQ0F=0 PL 4,3,0,4,3,4
top of the bowtie2 VCF data
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.0+htslib-1.0
##samtoolsCommand=samtools mpileup -ugf ref/HiSeq_UCSC_hg19.fa samtools_call/pos-sorted.bam
##reference=file://ref/HiSeq_UCSC_hg19.fa
##contig=<ID=chrM,length=16571>
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##ALT=<ID=X,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version=3>
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=ICB,Number=1,Type=Float,Description="Inbreeding Coefficient Binomial test (bigger is better)">
##INFO=<ID=HOB,Number=1,Type=Float,Description="Bias in the number of HOMs number (smaller is better)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=1.0+htslib-1.0
##bcftools_callCommand=call -vmO z -o samtools_call/results.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT samtools_call/pos-sorted.bam
chr1 12298469 . T C 10.1993 . DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=40 GT:PL 0/1:38,3,0
chr1 55384052 . GAA GAAA 39.9213 . INDEL;IDV=2;IMF=1;DP=2;VDB=0.26;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=42 GT:PL 1/1:68,6,0
chr1 97631230 . C T 10.1993 . DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=42 GT:PL 0/1:38,3,0
chr1 119251615 . C T 10.4759 . DP=3;VDB=0.0681597;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=15 GT:PL 1/1:38,9,0
chr1 119251625 . G A 10.4759 . DP=3;VDB=0.0681597;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=15 GT:PL 1/1:38,9,0
plot statistics from the bcftools calls
bcftools stats -F <ref.fa> -s - <results.vcf.gz> > <results.vcf.gz.stats> mkdir bcftools_plots plot-vcfstats -p bcftools_plots/ <results.vcf.gz.stats>
plot-vcfstats command
Usage: plot-vcfstats [OPTIONS] file.chk ...
plot-vcfstats -p outdir/ file.chk ...
Options:
-m, --merge Merge vcfstats files to STDOUT, skip plotting.
-p, --prefix <path> The output files prefix, add a slash to create new directory.
-P, --no-PDF Skip the PDF creation step.
-r, --rasterize Rasterize PDF images for fast rendering.
-s, --sample-names Use sample names for xticks rather than numeric IDs.
-t, --title <string> Identify files by these titles in plots. Can be given multiple times.
-T, --main-title <string> Main title for the PDF.
-h, -?, --help This help message.
- plots for the bcftools data bowtie2 PDF file bwa PDF file
Add filter field to flag lower quality data
bcftools filter -O z -o <results-filtered.vcf.gz> -s LOWQUAL -i'%QUAL>10' <results.vcf.gz>
bcftools filter command
Usage: bcftools filter [options] <in.vcf.gz>
Options:
-e, --exclude <expr> exclude sites for which the expression is true (e.g. '%TYPE="snp" && %QUAL>=10 && (DP4[2]+DP4[3] > 2')
-g, --SnpGap <int> filter SNPs within <int> base pairs of an indel
-G, --IndelGap <int> filter clusters of indels separated by <int> or fewer base pairs allowing only one to pass
-i, --include <expr> include only sites for which the expression is true
-m, --mode [+x] "+": do not replace but add to existing FILTER; "x": reset filters at sites which pass
-o, --output <file> write output to a file [standard output]
-O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-s, --soft-filter <string> annotate FILTER column with <string> or unique filter name ("Filter%d") made up by the program ("+")
-S, --set-GTs <.|0> set genotypes of failed samples to missing (.) or ref (0)
-t, --targets <region> similar to -r but streams rather than index-jumps
-T, --targets-file <file> similar to -R but streams rather than index-jumps
Filter expressions may contain:
- numerical constants, string constants, file names
.. 1, 1.0, 1e-4
.. "String"
.. @file_name
- arithmetic operators: +,*,-,/
- comparison operators: == (same as =), >, >=, <=, <, !=
- regex operator for string comparison: ~, !~
.. INFO/HAYSTACK ~ "needle"
- parentheses for grouping: (, )
- logical operators: &&, &, ||, |
- INFO tags, FORMAT tags, column names
.. INFO/DP or DP
.. FORMAT/DV, FMT/DV, or DV
.. %FILTER=="PASS", %QUAL>10, %ID!="."
.. %ID=@file, %ID!=@file .. selects IDs present/absent in the file
- 1 (or 0) to test the presence (or absence) of a flag
.. FlagA=1 && FlagB=0
- %TYPE for variant type in REF,ALT columns: indel,snp,mnp,ref,other
.. %TYPE="indel" | %TYPE="snp"
- array subscripts, * for any field:
.. (DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3
.. DP4[*]==0
- operations on FORMAT fields: MAX, MIN, AVG
.. %MIN(DV)>5
.. %MIN(DV/DP)>0.3
.. %MIN(DP)>10 & %MIN(DV)>3
.. %QUAL>10 | FMT/GQ>10 .. selects only GQ>10 samples
.. %QUAL>10 || FMT/GQ>10 .. selects all samples at QUAL>10 sites
top of the bowtie2 VCF data before and after filtering
chr1 12298469 . T C 10.1993 . DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=40 GT:PL 0/1:38,3,0
chr1 55384052 . GAA GAAA 39.9213 . INDEL;IDV=2;IMF=1;DP=2;VDB=0.26;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=42 GT:PL 1/1:68,6,0
chr1 97631230 . C T 10.1993 . DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=42 GT:PL 0/1:38,3,0
chr1 119251615 . C T 10.4759 . DP=3;VDB=0.0681597;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=15 GT:PL 1/1:38,9,0
chr1 119251625 . G A 10.4759 . DP=3;VDB=0.0681597;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=15 GT:PL 1/1:38,9,0
chr1 142537340 . G A 36.069 . DP=3;VDB=0.0249187;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,3,0;MQ=34 GT:PL 1/1:64,9,0
chr1 142537354 . T G 48.0682 . DP=3;VDB=0.0249187;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,3,0;MQ=34 GT:PL 1/1:76,9,0
chr1 142537398 . T A 17.2748 . DP=3;VDB=0.02;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=32 GT:PL 1/1:45,6,0
chr1 142537406 . A G 21.2256 . DP=3;VDB=0.02;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=32 GT:PL 1/1:49,6,0
chr1 142537556 . A G 52 . DP=3;VDB=0.199299;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=34 GT:PL 1/1:80,9,0
chr1 142687393 . A T 6.71778 . DP=12;VDB=3.87232e-06;SGB=-0.680642;MQSB=0.707404;MQ0F=0;AC=2;AN=2;DP4=0,0,9,3;MQ=6 GT:PL 1/1:33,36,0
chr1 142687417 . C T 10.1923 . DP=16;VDB=0.00584031;SGB=-0.686358;MQSB=0.606531;MQ0F=0;AC=2;AN=2;DP4=0,0,9,5;MQ=6 GT:PL 1/1:37,42,0
chr1 143415060 . A G 5.15896 . DP=9;VDB=0.0112009;SGB=-0.662043;MQSB=0;MQ0F=0.222222;AC=2;AN=2;DP4=0,0,2,7;MQ=4 GT:PL 1/1:31,27,0
chr1 143415068 . A T 5.15896 . DP=9;VDB=0.0112009;SGB=-0.662043;MQSB=0;MQ0F=0.222222;AC=2;AN=2;DP4=0,0,2,7;MQ=4 GT:PL 1/1:31,27,0
chr1 148881181 . A T 5.36537 . DP=29;VDB=0.00512804;SGB=-0.692976;RPB=1;MQB=1;MQSB=0.114218;BQB=1;MQ0F=0;AC=2;AN=2;DP4=0,1,24,2;MQ=6 GT:PL 1/1:31,67,0
chr1 148885930 . T C 10.7902 . DP=6;VDB=0.0041585;SGB=-0.590765;MQ0F=0;AC=2;AN=2;DP4=0,0,5,0;MQ=10 GT:PL 1/1:38,15,0
## and after applying the command
chr1 12298469 . T C 10.1993 PASS DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=40 GT:PL 0/1:38,3,0
chr1 55384052 . GAA GAAA 39.9213 PASS INDEL;IDV=2;IMF=1;DP=2;VDB=0.26;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=42 GT:PL 1/1:68,6,0
chr1 97631230 . C T 10.1993 PASS DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=42 GT:PL 0/1:38,3,0
chr1 119251615 . C T 10.4759 PASS DP=3;VDB=0.0681597;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=15 GT:PL 1/1:38,9,0
chr1 119251625 . G A 10.4759 PASS DP=3;VDB=0.0681597;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=15 GT:PL 1/1:38,9,0
chr1 142537340 . G A 36.069 PASS DP=3;VDB=0.0249187;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,3,0;MQ=34 GT:PL 1/1:64,9,0
chr1 142537354 . T G 48.0682 PASS DP=3;VDB=0.0249187;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,3,0;MQ=34 GT:PL 1/1:76,9,0
chr1 142537398 . T A 17.2748 PASS DP=3;VDB=0.02;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=32 GT:PL 1/1:45,6,0
chr1 142537406 . A G 21.2256 PASS DP=3;VDB=0.02;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=32 GT:PL 1/1:49,6,0
chr1 142537556 . A G 52 PASS DP=3;VDB=0.199299;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=34 GT:PL 1/1:80,9,0
chr1 142687393 . A T 6.71778 LOWQUAL DP=12;VDB=3.87232e-06;SGB=-0.680642;MQSB=0.707404;MQ0F=0;AC=2;AN=2;DP4=0,0,9,3;MQ=6 GT:PL 1/1:33,36,0
chr1 142687417 . C T 10.1923 PASS DP=16;VDB=0.00584031;SGB=-0.686358;MQSB=0.606531;MQ0F=0;AC=2;AN=2;DP4=0,0,9,5;MQ=6 GT:PL 1/1:37,42,0
chr1 143415060 . A G 5.15896 LOWQUAL DP=9;VDB=0.0112009;SGB=-0.662043;MQSB=0;MQ0F=0.222222;AC=2;AN=2;DP4=0,0,2,7;MQ=4 GT:PL 1/1:31,27,0
chr1 143415068 . A T 5.15896 LOWQUAL DP=9;VDB=0.0112009;SGB=-0.662043;MQSB=0;MQ0F=0.222222;AC=2;AN=2;DP4=0,0,2,7;MQ=4 GT:PL 1/1:31,27,0
chr1 148881181 . A T 5.36537 LOWQUAL DP=29;VDB=0.00512804;SGB=-0.692976;RPB=1;MQB=1;MQSB=0.114218;BQB=1;MQ0F=0;AC=2;AN=2;DP4=0,1,24,2;MQ=6 GT:PL 1/1:31,67,0
chr1 148885930 . T C 10.7902 PASS DP=6;VDB=0.0041585;SGB=-0.590765;MQ0F=0;AC=2;AN=2;DP4=0,0,5,0;MQ=10 GT:PL 1/1:38,15,0
Call with 'samtools mpileup' & 'Varscan2'
Varscan2 can be used to call variants from a samtools pileup in heuristic/statistic manner rather than using a probabilistic methods like bcftools. Reports claim that varscan2 can be as efficient as bcftools for this last step. We therefore provide the varscan2 command with default settings to allow users make their own mind. Please read more about this software on the varscan official site[2].
Varscan2 has predefined methods for aims (incl cancer pairs), please read more on the Varscan2 pages
Call Germline variants from a mpileup
# lines starting with '--' show here default values but can be altered to fit user needs # samples can be renamed using the '--vcf-sample-list' parameter ## process with pipes to save disk space samtools mpileup -f <ref.fa> <pos-sorted_mdup.bam> | \ java -Xmx8G -jar $BIOTOOLS/varscan2/VarScan.v2.3.7.jar mpileup2cns \ --min-coverage 8 \ --min-reads2 2 \ --min-avg-qual 15 \ --min-var-freq 0.01 \ --min-freq-for-hom 0.75 \ --p-value 99e-02 \ --strand-filter 1 \ --variants \ --output-vcf 1 | \ bgzip -c > varscan-results.vcf.gz # same commands in sequence # note that we do not repeat the default options here to keep it short #samtools mpileup -f <ref.fa> <pos-sorted_mdup.bam> > myData.pileup #cat myData.pileup | java -Xmx8G -jar $BIOTOOLS/varscan2/VarScan.v2.3.7.jar mpileup2cns \ # --variants \ # --output-vcf 1 | \ # bgzip -c > <varscan-results.vcf.gz> # index results tabix -p vcf <varscan-results.vcf.gz>
varscan2 results
Only variants will be reported
Warning: No p-value threshold provided, so p-values will not be calculated
Min coverage: 8
Min reads2: 2
Min var freq: 0.2
Min avg qual: 15
P-value thresh: 0.01
Input stream not ready, waiting for 5 seconds...
Reading input from STDIN
35412026 bases in pileup file
85707 variant positions (74640 SNP, 11067 indel)
1539 were failed by the strand-filter
84168 variant positions reported (73186 SNP, 10982 indel)
## bowtie varscan2 results
Only variants will be reported
Warning: No p-value threshold provided, so p-values will not be calculated
Min coverage: 8
Min reads2: 2
Min var freq: 0.2
Min avg qual: 15
P-value thresh: 0.01
Input stream not ready, waiting for 5 seconds...
Reading input from STDIN
35491310 bases in pileup file
85911 variant positions (74569 SNP, 11342 indel)
1346 were failed by the strand-filter
84565 variant positions reported (73303 SNP, 11262 indel)
varscan2 mpileup2cns command
Warning: No p-value threshold provided, so p-values will not be calculated
Min coverage: 8
Min reads2: 2
Min var freq: 0.2
Min avg qual: 15
P-value thresh: 0.01
USAGE: java -jar VarScan.jar mpileup2cns [pileup file] OPTIONS
mpileup file - The SAMtools mpileup file
OPTIONS:
--min-coverage Minimum read depth at a position to make a call [8]
--min-reads2 Minimum supporting reads at a position to call variants [2]
--min-avg-qual Minimum base quality at a position to count a read [15]
--min-var-freq Minimum variant allele frequency threshold [0.01]
--min-freq-for-hom Minimum frequency to call homozygote [0.75]
--p-value Default p-value threshold for calling variants [99e-02]
--strand-filter Ignore variants with >90% support on one strand [1]
--output-vcf If set to 1, outputs in VCF format
--vcf-sample-list For VCF output, a list of sample names in order, one per line
--variants Report only variant (SNP/indel) positions [0]
plot statistics from the varscan2 calls
bcftools stats -F <ref.fa> -s - <varscan-results.vcf.gz> > <varscan-results.vcf.gz.stats> mkdir varscan_plots plot-vcfstats -p varscan_plots/ <varscan-results.vcf.gz.stats>
- plots for the varscan2 data bowtie2 PDF file bwa PDF file
Comparing BWA & Bowtie2 results
The main question motivating this analysis was to quantify the impact of the choice of the mapper on the final result being the list of predicted variants in the NA18507 genome. The data selected for this experiment is ideal in some aspects and biased in others as detailed below.
plus points
- this SRA experiment is of high coverage
- the reads are paired which facilitates their unambiguous mapping
- the read length is sufficient to favor unique mapping
- all steps were performed in one go using the same software tools and identical reference genome (hg19)
- bowtie2 is used here and is more compared to bwa than bowtie1 usually reported in similar comparison studies
minus points
- the reads used here represent the chr21 subset of all reads mapped using different tools in a former mapping (against hg18) and therefore lack unmapped reads or ambiguous reads that could slightly change the results
- additional cleanup using GATK would be good to improve the indel part of the calls and reduce FPR in the final results (but should only slightlty change the global results)
intersecting results with VCFTools
We fist extract the chr21 calls from each vcf file to compare apples with apples
filter chr21 calls
# extract the chr21 subset of calls # vcftools --gzvcf <full-genome.vcf.gz> \ # --chr chr21 \ # --out <chr21.vcf.gz> \ # --recode ## repeat this for both bwa and bowtie2 VCF sets flagged here with <met> # bowtie calls vcftools --gzvcf results.vcf.gz \ --chr chr21 \ --out chr21_ \ --recode # perform some cleanup bgzip -c chr21_.recode.vcf > <met>_chr21_results.vcf.gz && rm chr21_.recode.vcf tabix -p vcf <met>_chr21_results.vcf.gz # varscan calls vcftools --gzvcf varscan-results.vcf.gz \ --chr chr21 \ --out chr21_varscan \ --recode # perform some cleanup bgzip -c chr21_varscan.recode.vcf > <met>_chr21_varscan-results.vcf.gz && rm chr21_varscan.recode.vcf tabix -p vcf <met>_chr21_results.vcf.gz
We then intersect the 4 files using vcftools to get data for a Venn plot
vcftools intersect 4 chr21 call-sets
# define input files to compare bowtie_bcftools=bowtie2_NA18507-data/bowtie2_chr21_results.vcf.gz bowtie_varscan=bowtie2_NA18507-data/bowtie2_chr21_varscan-results.vcf.gz bwa_bcftools=bwa_NA18507-data/bwa_chr21_results.vcf.gz bwa_varscan=bwa_NA18507-data/bwa_chr21_varscan-results.vcf.gz vcf-compare ${bowtie_bcftools} \ ${bowtie_varscan} \ ${bwa_bcftools} \ ${bwa_varscan} | \ grep ^VN | cut -f 2- | column -t \ > vcf-compare_4.txt # substitute names to ease reading perl -p -e 's:bowtie2_NA18507-data/bowtie2_chr21_results.vcf.gz:A:g' vcf-compare_4.txt | \ perl -p -e 's:bowtie2_NA18507-data/bowtie2_chr21_varscan-results.vcf.gz:B:g' | \ perl -p -e 's:bwa_NA18507-data/bwa_chr21_results.vcf.gz:C:g' | \ perl -p -e 's:bwa_NA18507-data/bwa_chr21_varscan-results.vcf.gz:D:g' | column -t 8 A (0.0%) D (0.0%) 109 A (0.1%) B (0.1%) D (0.1%) 179 D (0.2%) 267 A (0.3%) B (0.3%) 304 A (0.4%) C (0.3%) D (0.4%) 304 B (0.4%) C (0.3%) 610 A (0.7%) B (0.7%) C (0.7%) 684 B (0.8%) 835 C (0.9%) D (1.0%) 1942 B (2.4%) D (2.4%) 2038 A (2.5%) 3899 B (4.8%) C (4.4%) D (4.8%) 4100 C (4.6%) 4933 A (6.0%) C (5.5%) 74160 A (90.0%) B (90.5%) C (83.2%) D (91.1%) # the obtained value can be fed to the in-house developed Venn plotting tool 4DVenn.R --ad.count=8 --abd.count=109 --d.count=179 --ab.count=267 --acd.count=304 --bc.count=304 \ --abc.count=610 --b.count=684 --cd.count=835 --bd.count=1942 --a.count=2038 --bcd.count=3899 \ --c.count=4100 --ac.count=4933 --abcd.count=74160 -t "bowtie2 bwa comparison" -x 1 -o 4Dvenn -u 2
Legend:
- A=bowtie2_bcftools
- B=bowtie2_varscan
- C=bwa_bcftools
- D=bwa_varscan
comparing results to the validated data for NA18507
Intersecting calls should be complemented by the validation of the randomly selected variant calls. We present here only the intersection with HapMap3 data obtained from the Broad ftp site.
Comparing bowtie and bwa results to HapMap3 validated variants
filter chr21 calls
# define input files to compare bowtie_bcftools=bowtie2_chr21_results.vcf.gz bowtie_varscan=bowtie2_chr21_varscan-results.vcf.gz bwa_bcftools=bwa_chr21_results.vcf.gz bwa_varscan=bwa_chr21_varscan-results.vcf.gz hapmap=chr21_NA18507_hapmap_3.3.hg19.vcf.gz ## bowtie2 data vcf-compare ${bowtie_bcftools} \ ${bowtie_varscan} \ ${hapmap} | \ grep ^VN | cut -f 2- | column -t \ > vcf-bowtie-hapmap-compare.txt # substitute names to ease reading perl -p -e 's:bowtie2_chr21_results.vcf.gz:A:g' vcf-bowtie-hapmap-compare.txt | \ perl -p -e 's:bowtie2_chr21_varscan-results.vcf.gz:B:g' | \ perl -p -e 's:chr21_NA18507_hapmap_3.3.hg19.vcf.gz:C:g' | \ column -t > vcf-bowtie-hapmap-compare_ref.txt # plotting 3DVenn.R --bc.count=16 --ac.count=41 --c.count=66 --b.count=6813 \ --a.count=7242 --abc.count=9929 --ab.count=65217 \ -t "bowtie vs HapMap3.3" -o bowtie-vs-hapmap -u 2 -x 1 ## bwa data vcf-compare ${bwa_bcftools} \ ${bwa_varscan} \ ${hapmap} | \ grep ^VN | cut -f 2- | column -t \ > vcf-bwa-hapmap-compare.txt # substitute names to ease reading perl -p -e 's:bwa_chr21_results.vcf.gz:A:g' vcf-bwa-hapmap-compare.txt | \ perl -p -e 's:bwa_chr21_varscan-results.vcf.gz:B:g' | \ perl -p -e 's:chr21_NA18507_hapmap_3.3.hg19.vcf.gz:C:g' | \ column -t > vcf-bwa-hapmap-compare_ref.txt # plotting 3DVenn.R --bc.count=0 --ac.count=42 --c.count=65 --b.count=2238 \ --a.count=9905 --abc.count=9945 --ab.count=69253 \ -t "bwa vs HapMap3.3" -o bwa-vs-hapmap -u 2 -x 1
Legend:
- A=bowtie2_bcftools
- B=bowtie2_varscan
- C=HapMap3.3
Legend:
- A=bwa_bcftools
- B=bwa_varscan
- C=HapMap3.3
Conclusion
Several conclusion can be drawn from these results:
- the recall between both Bowtie2 ands Bwa mapping data is quite high and this is comforting for users prefering one maper tothe other. Keep however in mind that this is an ideal dataset with high amount of reads and that comes from pre-mapped data and therefore does not contain many difficult reads.
- the variants validated by HapMap are found in very similar wether the data is obtained from bowtie or bwa.
- the choice of the caller does not seem to cause large differences in count or recall by HapMap, suggesting that users may choose between bcftools or varscan2
- Varscan2 calls less private variants and is 'seems' less noisy. It is likely that filtering out low quality mappings before calling could reverse this.
- one additional control should be comparing all these calls with those obtained from the GATK post-processed workflow. This goes far beyond the simple aim of this experiment and some elements of answer can be found in our separate hands-on results presented in Hands-on_introduction_to_NGS_variant_analysis
References:
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis ]