Call variants with samtools 1.0

From BITS wiki
Jump to: navigation, search

Call chr21 variants from newly mapped BAM data using samtools (htslib) and bcftools (htslib) OR varscan2

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis ]


Introduction

This page content is largely inspired from the official htslib page[1] and is dedicated to one single step in the variant analysis workflow and to the recently released new version of Samtols based on the htslib. Commands presented here will differ from former commands using samtools until version 0.19 which was based on other C libraries. When calling several genomes together (replicates, family members, or patient control cohorts), the commands can be adapted by providing several BAM files. The call from several genomes should be done simultaneously in order to include full call information across all input genomes (no-calls, ref-calls, half-calls).

Technical.png Samtools mpileup was designed to process DIPLOID genomes, if your genome is not diploid, please consider using better adapted tools (eg. Broad GATK)

Technical.png file names shown between '<>' can be replaced by your own. If you copy the code, please remove the '<' and <>' delimiters

Workflow

Map reads to the reference genome

We chose to take only chr21 reads for this experiment in order to speed up the process while retaining full genome as a reference.

This part is not detailed here in order to come to the point. Both BWA and Bowtie2 were used with standard parameters to generate a SAM file for each mapper. The mapping was performed on a Centos computer with 8 threads and a maximum of 4GB RAM to provide material reproducible by the average user.

mapping commands

## Bowtie2 mapping command
bowtie2 -p 8 -x /data/biodata/bowtie-index/HiSeq_UCSC_hg19 -q \
    -1 PE_NA18507_GAIIx_100_1.fq.gz -2 PE_NA18507_GAIIx_100_2.fq.gz --phred33 --fr -S  bowtie_mapped-reads.sam
# Real time: 991 sec
 
## BWA mapping command
bwa mem -R @RG\tID:NA18507\tLB:lib-NA18507\tPU:unkn-0.0\tPL:ILLUMINA\tSM:GAIIx-chr21-BWA.mem \
    -M -t 8 ref/HiSeq_UCSC_hg19.fa PE_NA18507_GAIIx_100_1.fq.gz PE_NA18507_GAIIx_100_2.fq.gz
# Real time: 922 sec
BWA monitoring Bowtie2 monitoring
bwa_usage1401552529_usage.png
bowtie2_usage1401561413_usage.png
bwa_usage1401560123_usage.png
bowtie2_usage1401548356_usage.png

Time recording of each mapping returned values reproduced here as reference and showing no speed advantage for Bowtie with ~15min for bwa and 17 min for bowtie2. This is in contradiction to what has been reported by others using older versions of both BWA and Bowtie.

Clean the mapping data from mate errors

Handicon.png We use here NA18507 chr21 data introduced in the Hands-on_introduction_to_NGS_variant_analysis NGS_Exercise.5 and mapped using bowtie2

clean up read pairing information and flags and sort

Samtools mpileup needs the BAM file be position-sorted instead of unsorted or read-name-sorted as obtained from some processing tools like Picard. The following command will sort a BAM files to comply. Furthermore, data obtained from bowtie or BWA may have issues with mated reads. It is best to fix these potential issues now.

Handicon.png The next commands use 8threads and 12GB RAM, adapt to your own machine by tuning -m and -@

# sort reads by name required by the next command
samtools sort -n -m 12G -@ 8 -O bam -T </tmp/sort1> -o <name-sorted.bam> <mapped-reads.sam>
 
# fix mate information and output to bam
samtools fixmate -O bam <name-sorted.bam> <fixmate.bam>
 
# sort the bam data by position
samtools sort -m 12G -@ 8 -O bam -T </tmp/sort2> -o <pos-sorted.bam> <fixmate.bam>

samtools sort command

$ samtools sort
Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 to 9 [-1]
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -o FILE    Write final output to FILE rather than standard output
  -O FORMAT  Write output as FORMAT ('sam'/'bam'/'cram')   (either -O or
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam       -T is required)
  -@ INT     Set number of sorting and compression threads [1]

Legacy usage: samtools sort [options...] <in.bam> <out.prefix>
Options:
  -f         Use <out.prefix> as full final filename rather than prefix
  -o         Write final output to stdout rather than <out.prefix>.bam
  -l,m,n,@   Similar to corresponding options above

samtools fixmate command

$ samtools fixmate
Usage: samtools fixmate <in.nameSrt.bam> <out.nameSrt.bam>

Options:
  -r         Remove unmapped reads and secondary alignments
  -p         Disable FR proper pair check
  -O FORMAT  Write output as FORMAT ('sam'/'bam'/'cram')
As elsewhere in samtools, use '-' as the filename for stdin/stdout. The input
file must be grouped by read name (e.g. sorted by name). Coordinated sorted
input is not accepted.

Evaluate the data with 'samtools flagstat' during the cleaning process

samtools flagstat output before fixing mates

$ samtools flagstat name-sorted.bam
14990194 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
14858505 + 0 mapped (99.12%:-nan%)
14990194 + 0 paired in sequencing
7495097 + 0 read1
7495097 + 0 read2
14725068 + 0 properly paired (98.23%:-nan%)
14774636 + 0 with itself and mate mapped
83869 + 0 singletons (0.56%:-nan%)
38992 + 0 with mate mapped to a different chr
26405 + 0 with mate mapped to a different chr (mapQ>=5)

samtools flagstat output after fixing mates

$ samtools flagstat fixmate.bam
14990194 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
14858505 + 0 mapped (99.12%:-nan%)
14990194 + 0 paired in sequencing
7495097 + 0 read1
7495097 + 0 read2
14725068 + 0 properly paired (98.23%:-nan%)
14774636 + 0 with itself and mate mapped
83869 + 0 singletons (0.56%:-nan%)
38992 + 0 with mate mapped to a different chr
26405 + 0 with mate mapped to a different chr (mapQ>=5)

Nothing changed during the fixmate process which is a good thing.

Mark duplicates using Picard tools

Read duplicates (PCR duplicates) can arise from excessive PCR amplification of the sequencing library. Similarly, optical duplicates can result from re-scanning the same read cluster on the Illumina chip and create distinct read entries from the same physical spot. Both types of duplicates contribute to the amplification of sequencing error and create false positive calls and should best be 'marked' (but not removed) before proceeding with the analysis. Duplicate marking can be done using several existing tools but most often with Picard (here Picard version 119).

# requirements: input is coordinate-sorted
java -jar $PICARD/MarkDuplicates.jar \
    I=<pos-sorted.bam> \
    O=<pos-sorted_mdup.bam> \
    M=<MarkDuplicate_metrics.txt> \
    REMOVE_DUPLICATES=false \
    AS=true

Picard markduplicates command

USAGE: MarkDuplicates [options]
Documentation: http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates
Examines aligned records in the supplied SAM or BAM file to locate duplicate molecules.
All records are then written to the output file with the duplicate records flagged.
Version: 1.119(d44cdb51745f5e8075c826430a39d8a61f1dd832_1408991805)

Options:
--help
-h                            Displays options specific to this tool.

--stdhelp
-H                            Displays options specific to this tool AND options common to all Picard command line
                              tools.

--version                     Displays program version.

INPUT=File
I=File                        One or more input SAM or BAM files to analyze. Must be coordinate sorted.  
                              May not be a stream because file is read twice.  This option may be specified 0 or more times.

OUTPUT=File
O=File                        The output file to write marked records to  Required.

METRICS_FILE=File
M=File                        File to write duplication metrics to  Required.

PROGRAM_RECORD_ID=String
PG=String                     The program record ID for the @PG record(s) created by this program. Set to null to
                              disable PG record creation.  This string may have a suffix appended to avoid collision
                              with other program record IDs.  Default value: MarkDuplicates. This option can be set to
                              'null' to clear the default value.

PROGRAM_GROUP_VERSION=String
PG_VERSION=String             Value of VN tag of PG record to be created. If not specified, the version will be
                              detected automatically.  Default value: null.

PROGRAM_GROUP_COMMAND_LINE=String
PG_COMMAND=String             Value of CL tag of PG record to be created. If not supplied the command line will be
                              detected automatically.  Default value: null.

PROGRAM_GROUP_NAME=String
PG_NAME=String                Value of PN tag of PG record to be created.  Default value: MarkDuplicates. This option
                              can be set to 'null' to clear the default value.

COMMENT=String
CO=String                     Comment(s) to include in the output file's header.  This option may be specified 0 or
                              more times.

REMOVE_DUPLICATES=Boolean     If true do not write duplicates to the output file instead of writing them with
                              appropriate flags set.  Default value: false. This option can be set to 'null' to clear
                              the default value. Possible values: {true, false}

ASSUME_SORTED=Boolean
AS=Boolean                    If true, assume that the input file is coordinate sorted even if the header says
                              otherwise.  Default value: false. This option can be set to 'null' to clear the default
                              value. Possible values: {true, false}

MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=Integer
MAX_SEQS=Integer              This option is obsolete. ReadEnds will always be spilled to disk.  Default value: 50000.
                              This option can be set to 'null' to clear the default value.

MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=Integer
MAX_FILE_HANDLES=Integer      Maximum number of file handles to keep open when spilling read ends to disk. Set this
                              number a little lower than the per-process maximum number of file that may be open. This
                              number can be found by executing the 'ulimit -n' command on a Unix system.  Default
                              value: 8000. This option can be set to 'null' to clear the default value.

SORTING_COLLECTION_SIZE_RATIO=Double
                              This number, plus the maximum RAM available to the JVM, determine the memory footprint
                              used by some of the sorting collections.  If you are running out of memory, try reducing
                              this number.  Default value: 0.25. This option can be set to 'null' to clear the default
                              value.

READ_NAME_REGEX=String        Regular expression that can be used to parse read names in the incoming SAM file. Read
                              names are parsed to extract three variables: tile/region, x coordinate and y coordinate.
                              These values are used to estimate the rate of optical duplication in order to give a more
                              accurate estimated library size. Set this option to null to disable optical duplicate
                              detection. The regular expression should contain three capture groups for the three
                              variables, in order. It must match the entire read name. Note that if the default regex
                              is specified, a regex match is not actually done, but instead the read name  is split on
                              colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be
                              tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements
                              are assumed to be tile, x and y values.  Default value:
                              [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*. This option can be set to 'null' to
                              clear the default value.

OPTICAL_DUPLICATE_PIXEL_DISTANCE=Integer
                              The maximum offset between two duplicte clusters in order to consider them optical
                              duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels)
                              unless using later versions of the Illumina pipeline that multiply pixel values by 10, in
                              which case 50-100 is more normal.  Default value: 100. This option can be set to 'null'
                              to clear the default value.
BWA duplicates Bowtie2 duplicates
LIBRARY                       lib-NA18507
UNPAIRED_READS_EXAMINED       17268
READ_PAIRS_EXAMINED           7477196
UNMAPPED_READS                18534
UNPAIRED_READ_DUPLICATES      4726
READ_PAIR_DUPLICATES          5467
READ_PAIR_OPTICAL_DUPLICATES  1311
PERCENT_DUPLICATION           0.001046
ESTIMATED_LIBRARY_SIZE        6721383713
LIBRARY                       Unknown-Library
UNPAIRED_READS_EXAMINED       83869
READ_PAIRS_EXAMINED           7387318
UNMAPPED_READS                131689
UNPAIRED_READ_DUPLICATES      24272
READ_PAIR_DUPLICATES          3371
READ_PAIR_OPTICAL_DUPLICATES  1110
PERCENT_DUPLICATION           0.002087
ESTIMATED_LIBRARY_SIZE        12062126176

As seen above, BWA results in less unmapped reads with 18'534 against 131'689 for bowtie2. Other metrics differ and are left to the reader to evaluate. More BAM QC can be performed using different tools and is not detailed further.

Call variants with 'samtools mpileup' & 'bcftools'

Perform local re-alignment of reads and output to BCF and VCF

Technical.png The 'ref.fa' file is the exact same genome fasta file used to build the bowtie/bwa mapping index

## cryptic parameters below read as follows:
 
## samtools mpileup
# -u, --uncompressed      generate uncompressed VCF/BCF output
# -g, --BCF               generate genotype likelihoods in BCF format
# -f, --fasta-ref FILE    faidx indexed reference sequence file
 
## bcftools
# -v, --variants-only             output variant sites only
# -m, --multiallelic-caller       alternative model for multiallelic and rare-variant calling (conflicts with -c)
# -O, --output-type <b|u|z|v>     output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
# -o, --output <file>             write output to a file [standard output]
samtools mpileup -ugf <ref.fa> <pos-sorted_mdup.bam> | bcftools call -vmO z -o <results.vcf.gz>
 
# the same command split in two steps would be
# -o, --output FILE       write output to FILE [standard output]
# samtools mpileup -go <results.bcf> -f <ref.fa> <pos-sorted_mdup.bam>
# bcftools call -vmO z -o <results.vcf.gz> <results.bcf>
 
# index the results for fast query
tabix -p vcf <results.vcf.gz>

samtools mpileup command

$ samtools mpileup

Usage: samtools mpileup [options] in1.bam [in2.bam [...]]

Input options:
  -6, --illumina1.3+      quality is in the Illumina-1.3+ encoding
  -A, --count-orphans     do not discard anomalous read pairs
  -b, --bam-list FILE     list of input BAM filenames, one per line
  -B, --no-BAQ            disable BAQ (per-Base Alignment Quality)
  -C, --adjust-MQ INT     adjust mapping quality; recommended:50, disable:0 [0]
  -d, --max-depth INT     max per-BAM depth; avoids excessive memory usage [250]
  -E, --redo-BAQ          recalculate BAQ on the fly, ignore existing BQs
  -f, --fasta-ref FILE    faidx indexed reference sequence file
  -G, --exclude-RG FILE   exclude read groups listed in FILE
  -l, --positions FILE    skip unlisted positions (chr pos) or regions (BED)
  -q, --min-MQ INT        skip alignments with mapQ smaller than INT [0]
  -Q, --min-BQ INT        skip bases with baseQ/BAQ smaller than INT [13]
  -r, --region REG        region in which pileup is generated
  -R, --ignore-RG         ignore RG tags (one BAM = one sample)
  --rf, --incl-flags STR|INT  required flags: skip reads with mask bits unset []
  --ff, --excl-flags STR|INT  filter flags: skip reads with mask bits set
                                            [UNMAP,SECONDARY,QCFAIL,DUP]
  -x, --ignore-overlaps   disable read-pair overlap detection

Output options:
  -o, --output FILE       write output to FILE [standard output]
  -g, --BCF               generate genotype likelihoods in BCF format
  -v, --VCF               generate genotype likelihoods in VCF format

Output options for mpileup format (without -g/-v):
  -O, --output-BP         output base positions on reads
  -s, --output-MQ         output mapping quality

Output options for genotype likelihoods (when -g/-v is used):
  -t, --output-tags LIST  optional tags to output: DP,DPR,DV,DP4,INFO/DPR,SP []
  -u, --uncompressed      generate uncompressed VCF/BCF output

SNP/INDEL genotype likelihoods options (effective with -g/-v):
  -e, --ext-prob INT      Phred-scaled gap extension seq error probability [20]
  -F, --gap-frac FLOAT    minimum fraction of gapped reads [0.002]
  -h, --tandem-qual INT   coefficient for homopolymer errors [100]
  -I, --skip-indels       do not perform indel calling
  -L, --max-idepth INT    maximum per-sample depth for INDEL calling [250]
  -m, --min-ireads INT    minimum number gapped reads for indel candidates [1]
  -o, --open-prob INT     Phred-scaled gap open seq error probability [40]
  -p, --per-sample-mF     apply -m and -F per-sample for increased sensitivity
  -P, --platforms STR     comma separated list of platforms for indels [all]

Notes: Assuming diploid individuals.

bcftools call command

$ bcftools call

About:   SNP/indel variant calling from VCF/BCF. To be used in conjunction with samtools mpileup.
         This command replaces the former "bcftools view" caller. Some of the original
         functionality has been temporarily lost in the process of transition to htslib,
         but will be added back on popular demand. The original calling model can be
         invoked with the -c option.
Usage:   bcftools call [options] <in.vcf.gz>

File format options:
   -o, --output <file>             write output to a file [standard output]
   -O, --output-type <b|u|z|v>     output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
   -r, --regions <region>          restrict to comma-separated list of regions
   -R, --regions-file <file>       restrict to regions listed in a file
   -s, --samples <list>            list of samples to include [all samples]
   -S, --samples-file <file>       PED file or a file with optional second column for ploidy (0, 1 or 2) [all samples]
   -t, --targets <region>          similar to -r but streams rather than index-jumps
   -T, --targets-file <file>       similar to -R but streams rather than index-jumps

Input/output options:
   -A, --keep-alts                 keep all possible alternate alleles at variant sites
   -f, --format-fields <list>      output format fields: GQ,GP (lowercase allowed) []
   -g, --gvcf <minDP>              output gVCF blocks of homozygous REF calls
   -i, --insert-missed             output also sites missed by mpileup but present in -T
   -M, --keep-masked-ref           keep sites with masked reference allele (REF=N)
   -V, --skip-variants <type>      skip indels/snps
   -v, --variants-only             output variant sites only

Consensus/variant calling options:
   -c, --consensus-caller          the original calling method (conflicts with -m)
   -C, --constrain <str>           one of: alleles, trio (see manual)
   -m, --multiallelic-caller       alternative model for multiallelic and rare-variant calling (conflicts with -c)
   -n, --novel-rate <float>,[...]  likelihood of novel mutation for constrained trio calling, see man page for details [1e-8,1e-9,1e-9]
   -p, --pval-threshold <float>    variant if P(ref|D)<FLOAT with -c [0.5]
   -P, --prior <float>             mutation rate [1e-3]
   -X, --chromosome-X              haploid output for male samples (requires PED file with -s)
   -Y, --chromosome-Y              haploid output for males and skips females (requires PED file with -s)

tabix command

$ tabix

Version: 1.0
Usage:   tabix [OPTIONS] [FILE] [REGION [...]]
Options:
   -0, --zero-based        coordinates are zero-based
   -b, --begin INT         column number for region start [4]
   -c, --comment CHAR      skip comment lines starting with CHAR [null]
   -e, --end INT           column number for region end (if no end, set INT to -b) [5]
   -f, --force             overwrite existing index without asking
   -h, --print-header      print also the header lines
   -H, --only-header       print only the header lines
   -l, --list-chroms       list chromosome names
   -m, --min-shift INT     set the minimal interval size to 1<<INT; 0 for the old tabix index [0]
   -p, --preset STR        gff, bed, sam, vcf, bcf, bam
   -r, --reheader FILE     replace the header with the content of FILE
   -s, --sequence INT      column number for sequence names (suppressed by -p) [1]
   -S, --skip-lines INT    skip first INT lines [0]

review BCF and VCF results

top of the bowtie2 BCF data

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.0+htslib-1.0
##samtoolsCommand=samtools mpileup -go results.bcf -f ../ref/HiSeq_UCSC_hg19.fa pos-sorted.bam
##reference=file://../ref/HiSeq_UCSC_hg19.fa
##contig=<ID=chrM,length=16571>
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##ALT=<ID=X,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version=3>
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##bcftools_viewVersion=1.0+htslib-1.0
##bcftools_viewCommand=view results.bcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  pos-sorted.bam
chr1    9986    .       N       C,<X>   0       .       DP=1;I16=0,0,1,0,0,0,38,1444,0,0,2,4,0,0,0,0;QS=0,1,0;SGB=-0.379885;MQ0F=0      PL      4,3,0,4,3,4
chr1    9987    .       N       T,<X>   0       .       DP=1;I16=0,0,1,0,0,0,38,1444,0,0,2,4,0,0,1,1;QS=0,1,0;SGB=-0.379885;MQ0F=0      PL      4,3,0,4,3,4
chr1    9988    .       N       A,<X>   0       .       DP=1;I16=0,0,1,0,0,0,38,1444,0,0,2,4,0,0,2,4;QS=0,1,0;SGB=-0.379885;MQ0F=0      PL      4,3,0,4,3,4
chr1    9989    .       N       A,<X>   0       .       DP=1;I16=0,0,1,0,0,0,38,1444,0,0,2,4,0,0,3,9;QS=0,1,0;SGB=-0.379885;MQ0F=0      PL      4,3,0,4,3,4
chr1    9990    .       N       C,<X>   0       .       DP=1;I16=0,0,1,0,0,0,38,1444,0,0,2,4,0,0,4,16;QS=0,1,0;SGB=-0.379885;MQ0F=0     PL      4,3,0,4,3,4

top of the bowtie2 VCF data

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.0+htslib-1.0
##samtoolsCommand=samtools mpileup -ugf ref/HiSeq_UCSC_hg19.fa samtools_call/pos-sorted.bam
##reference=file://ref/HiSeq_UCSC_hg19.fa
##contig=<ID=chrM,length=16571>
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##ALT=<ID=X,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version=3>
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=ICB,Number=1,Type=Float,Description="Inbreeding Coefficient Binomial test (bigger is better)">
##INFO=<ID=HOB,Number=1,Type=Float,Description="Bias in the number of HOMs number (smaller is better)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=1.0+htslib-1.0
##bcftools_callCommand=call -vmO z -o samtools_call/results.vcf.gz
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  samtools_call/pos-sorted.bam
chr1    12298469        .       T       C       10.1993 .       DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=40     GT:PL   0/1:38,3,0
chr1    55384052        .       GAA     GAAA    39.9213 .       INDEL;IDV=2;IMF=1;DP=2;VDB=0.26;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=42        GT:PL   1/1:68,6,0
chr1    97631230        .       C       T       10.1993 .       DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=42     GT:PL   0/1:38,3,0
chr1    119251615       .       C       T       10.4759 .       DP=3;VDB=0.0681597;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=15     GT:PL   1/1:38,9,0
chr1    119251625       .       G       A       10.4759 .       DP=3;VDB=0.0681597;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=15     GT:PL   1/1:38,9,0

plot statistics from the bcftools calls

bcftools stats -F <ref.fa> -s - <results.vcf.gz> > <results.vcf.gz.stats>
mkdir bcftools_plots
plot-vcfstats -p bcftools_plots/ <results.vcf.gz.stats>

plot-vcfstats command

plot-vcfstats
Usage: plot-vcfstats [OPTIONS] file.chk ...
       plot-vcfstats -p outdir/ file.chk ...
Options:
   -m, --merge                         Merge vcfstats files to STDOUT, skip plotting.
   -p, --prefix <path>                 The output files prefix, add a slash to create new directory.
   -P, --no-PDF                        Skip the PDF creation step.
   -r, --rasterize                     Rasterize PDF images for fast rendering.
   -s, --sample-names                  Use sample names for xticks rather than numeric IDs.
   -t, --title <string>                Identify files by these titles in plots. Can be given multiple times.
   -T, --main-title <string>           Main title for the PDF.
   -h, -?, --help                      This help message.

Add filter field to flag lower quality data

bcftools filter -O z -o <results-filtered.vcf.gz> -s LOWQUAL -i'%QUAL>10' <results.vcf.gz>

bcftools filter command

About:   Apply fixed-threshold filters.
Usage:   bcftools filter [options] <in.vcf.gz>

Options:
    -e, --exclude <expr>          exclude sites for which the expression is true (e.g. '%TYPE="snp" && %QUAL>=10 && (DP4[2]+DP4[3] > 2')
    -g, --SnpGap <int>            filter SNPs within <int> base pairs of an indel
    -G, --IndelGap <int>          filter clusters of indels separated by <int> or fewer base pairs allowing only one to pass
    -i, --include <expr>          include only sites for which the expression is true
    -m, --mode [+x]               "+": do not replace but add to existing FILTER; "x": reset filters at sites which pass
    -o, --output <file>           write output to a file [standard output]
    -O, --output-type <b|u|z|v>   b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
    -r, --regions <region>        restrict to comma-separated list of regions
    -R, --regions-file <file>     restrict to regions listed in a file
    -s, --soft-filter <string>    annotate FILTER column with <string> or unique filter name ("Filter%d") made up by the program ("+")
    -S, --set-GTs <.|0>           set genotypes of failed samples to missing (.) or ref (0)
    -t, --targets <region>        similar to -r but streams rather than index-jumps
    -T, --targets-file <file>     similar to -R but streams rather than index-jumps

Filter expressions may contain:
    - numerical constants, string constants, file names
        .. 1, 1.0, 1e-4
        .. "String"
        .. @file_name
    - arithmetic operators: +,*,-,/
    - comparison operators: == (same as =), >, >=, <=, <, !=
    - regex operator for string comparison: ~, !~
        .. INFO/HAYSTACK ~ "needle"
    - parentheses for grouping: (, )
    - logical operators: &&, &, ||, |
    - INFO tags, FORMAT tags, column names
        .. INFO/DP or DP
        .. FORMAT/DV, FMT/DV, or DV
        .. %FILTER=="PASS", %QUAL>10, %ID!="."
        .. %ID=@file, %ID!=@file  .. selects IDs present/absent in the file
    - 1 (or 0) to test the presence (or absence) of a flag
        .. FlagA=1 && FlagB=0
    - %TYPE for variant type in REF,ALT columns: indel,snp,mnp,ref,other
        .. %TYPE="indel" | %TYPE="snp"
    - array subscripts, * for any field:
        .. (DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3
        .. DP4[*]==0
    - operations on FORMAT fields: MAX, MIN, AVG
        .. %MIN(DV)>5
        .. %MIN(DV/DP)>0.3
        .. %MIN(DP)>10 & %MIN(DV)>3
        .. %QUAL>10 |  FMT/GQ>10   .. selects only GQ>10 samples
        .. %QUAL>10 || FMT/GQ>10   .. selects all samples at QUAL>10 sites

top of the bowtie2 VCF data before and after filtering

## top rows before filtering

chr1  12298469   .  T    C     10.1993  .  DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=40                                    GT:PL  0/1:38,3,0
chr1  55384052   .  GAA  GAAA  39.9213  .  INDEL;IDV=2;IMF=1;DP=2;VDB=0.26;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=42                       GT:PL  1/1:68,6,0
chr1  97631230   .  C    T     10.1993  .  DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=42                                    GT:PL  0/1:38,3,0
chr1  119251615  .  C    T     10.4759  .  DP=3;VDB=0.0681597;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=15                                    GT:PL  1/1:38,9,0
chr1  119251625  .  G    A     10.4759  .  DP=3;VDB=0.0681597;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=15                                    GT:PL  1/1:38,9,0
chr1  142537340  .  G    A     36.069   .  DP=3;VDB=0.0249187;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,3,0;MQ=34                                    GT:PL  1/1:64,9,0
chr1  142537354  .  T    G     48.0682  .  DP=3;VDB=0.0249187;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,3,0;MQ=34                                    GT:PL  1/1:76,9,0
chr1  142537398  .  T    A     17.2748  .  DP=3;VDB=0.02;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=32                                         GT:PL  1/1:45,6,0
chr1  142537406  .  A    G     21.2256  .  DP=3;VDB=0.02;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=32                                         GT:PL  1/1:49,6,0
chr1  142537556  .  A    G     52       .  DP=3;VDB=0.199299;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=34                                     GT:PL  1/1:80,9,0
chr1  142687393  .  A    T     6.71778  .  DP=12;VDB=3.87232e-06;SGB=-0.680642;MQSB=0.707404;MQ0F=0;AC=2;AN=2;DP4=0,0,9,3;MQ=6                    GT:PL  1/1:33,36,0
chr1  142687417  .  C    T     10.1923  .  DP=16;VDB=0.00584031;SGB=-0.686358;MQSB=0.606531;MQ0F=0;AC=2;AN=2;DP4=0,0,9,5;MQ=6                     GT:PL  1/1:37,42,0
chr1  143415060  .  A    G     5.15896  .  DP=9;VDB=0.0112009;SGB=-0.662043;MQSB=0;MQ0F=0.222222;AC=2;AN=2;DP4=0,0,2,7;MQ=4                       GT:PL  1/1:31,27,0
chr1  143415068  .  A    T     5.15896  .  DP=9;VDB=0.0112009;SGB=-0.662043;MQSB=0;MQ0F=0.222222;AC=2;AN=2;DP4=0,0,2,7;MQ=4                       GT:PL  1/1:31,27,0
chr1  148881181  .  A    T     5.36537  .  DP=29;VDB=0.00512804;SGB=-0.692976;RPB=1;MQB=1;MQSB=0.114218;BQB=1;MQ0F=0;AC=2;AN=2;DP4=0,1,24,2;MQ=6  GT:PL  1/1:31,67,0
chr1  148885930  .  T    C     10.7902  .  DP=6;VDB=0.0041585;SGB=-0.590765;MQ0F=0;AC=2;AN=2;DP4=0,0,5,0;MQ=10                                    GT:PL  1/1:38,15,0

## and after applying the command

chr1  12298469   .  T    C     10.1993  PASS     DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=40                                    GT:PL  0/1:38,3,0
chr1  55384052   .  GAA  GAAA  39.9213  PASS     INDEL;IDV=2;IMF=1;DP=2;VDB=0.26;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=42                       GT:PL  1/1:68,6,0
chr1  97631230   .  C    T     10.1993  PASS     DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=42                                    GT:PL  0/1:38,3,0
chr1  119251615  .  C    T     10.4759  PASS     DP=3;VDB=0.0681597;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=15                                    GT:PL  1/1:38,9,0
chr1  119251625  .  G    A     10.4759  PASS     DP=3;VDB=0.0681597;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=15                                    GT:PL  1/1:38,9,0
chr1  142537340  .  G    A     36.069   PASS     DP=3;VDB=0.0249187;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,3,0;MQ=34                                    GT:PL  1/1:64,9,0
chr1  142537354  .  T    G     48.0682  PASS     DP=3;VDB=0.0249187;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,3,0;MQ=34                                    GT:PL  1/1:76,9,0
chr1  142537398  .  T    A     17.2748  PASS     DP=3;VDB=0.02;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=32                                         GT:PL  1/1:45,6,0
chr1  142537406  .  A    G     21.2256  PASS     DP=3;VDB=0.02;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=32                                         GT:PL  1/1:49,6,0
chr1  142537556  .  A    G     52       PASS     DP=3;VDB=0.199299;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=34                                     GT:PL  1/1:80,9,0
chr1  142687393  .  A    T     6.71778  LOWQUAL  DP=12;VDB=3.87232e-06;SGB=-0.680642;MQSB=0.707404;MQ0F=0;AC=2;AN=2;DP4=0,0,9,3;MQ=6                    GT:PL  1/1:33,36,0
chr1  142687417  .  C    T     10.1923  PASS     DP=16;VDB=0.00584031;SGB=-0.686358;MQSB=0.606531;MQ0F=0;AC=2;AN=2;DP4=0,0,9,5;MQ=6                     GT:PL  1/1:37,42,0
chr1  143415060  .  A    G     5.15896  LOWQUAL  DP=9;VDB=0.0112009;SGB=-0.662043;MQSB=0;MQ0F=0.222222;AC=2;AN=2;DP4=0,0,2,7;MQ=4                       GT:PL  1/1:31,27,0
chr1  143415068  .  A    T     5.15896  LOWQUAL  DP=9;VDB=0.0112009;SGB=-0.662043;MQSB=0;MQ0F=0.222222;AC=2;AN=2;DP4=0,0,2,7;MQ=4                       GT:PL  1/1:31,27,0
chr1  148881181  .  A    T     5.36537  LOWQUAL  DP=29;VDB=0.00512804;SGB=-0.692976;RPB=1;MQB=1;MQSB=0.114218;BQB=1;MQ0F=0;AC=2;AN=2;DP4=0,1,24,2;MQ=6  GT:PL  1/1:31,67,0
chr1  148885930  .  T    C     10.7902  PASS     DP=6;VDB=0.0041585;SGB=-0.590765;MQ0F=0;AC=2;AN=2;DP4=0,0,5,0;MQ=10                                    GT:PL  1/1:38,15,0

Call with 'samtools mpileup' & 'Varscan2'

VarScan-logo.jpg

Varscan2 can be used to call variants from a samtools pileup in heuristic/statistic manner rather than using a probabilistic methods like bcftools. Reports claim that varscan2 can be as efficient as bcftools for this last step. We therefore provide the varscan2 command with default settings to allow users make their own mind. Please read more about this software on the varscan official site[2].

Technical.png Varscan2 has predefined methods for aims (incl cancer pairs), please read more on the Varscan2 pages

Call Germline variants from a mpileup

Technical.png Note the '--variants ' options that limits output to non-Ref calls

#  lines starting with '--' show here default values but can be altered to fit user needs
# samples can be renamed using the '--vcf-sample-list' parameter
 
## process with pipes to save disk space
samtools mpileup -f <ref.fa> <pos-sorted_mdup.bam> | \
    java -Xmx8G -jar $BIOTOOLS/varscan2/VarScan.v2.3.7.jar mpileup2cns \
    --min-coverage 8 \
    --min-reads2 2 \
    --min-avg-qual 15 \
    --min-var-freq 0.01 \
    --min-freq-for-hom 0.75 \
    --p-value 99e-02 \
    --strand-filter 1 \
    --variants \
    --output-vcf 1 | \
    bgzip -c > varscan-results.vcf.gz
 
# same commands in sequence
# note that we do not repeat the default options here to keep it short
#samtools mpileup -f <ref.fa> <pos-sorted_mdup.bam> > myData.pileup
#cat myData.pileup | java -Xmx8G -jar $BIOTOOLS/varscan2/VarScan.v2.3.7.jar mpileup2cns \
#    --variants \
#    --output-vcf 1 | \
#    bgzip -c > <varscan-results.vcf.gz>
 
# index results
tabix -p vcf <varscan-results.vcf.gz>

varscan2 results

## bwa varscan2 results

Only variants will be reported
Warning: No p-value threshold provided, so p-values will not be calculated
Min coverage:   8
Min reads2:     2
Min var freq:   0.2
Min avg qual:   15
P-value thresh: 0.01
Input stream not ready, waiting for 5 seconds...
Reading input from STDIN
35412026 bases in pileup file
85707 variant positions (74640 SNP, 11067 indel)
1539 were failed by the strand-filter
84168 variant positions reported (73186 SNP, 10982 indel)

## bowtie varscan2 results

Only variants will be reported
Warning: No p-value threshold provided, so p-values will not be calculated
Min coverage:   8
Min reads2:     2
Min var freq:   0.2
Min avg qual:   15
P-value thresh: 0.01
Input stream not ready, waiting for 5 seconds...
Reading input from STDIN
 35491310 bases in pileup file
85911 variant positions (74569 SNP, 11342 indel)
1346 were failed by the strand-filter
84565 variant positions reported (73303 SNP, 11262 indel)

varscan2 mpileup2cns command

varscan2 mpileup2cns
Warning: No p-value threshold provided, so p-values will not be calculated
Min coverage:   8
Min reads2:     2
Min var freq:   0.2
Min avg qual:   15
P-value thresh: 0.01
USAGE: java -jar VarScan.jar mpileup2cns [pileup file] OPTIONS
        mpileup file - The SAMtools mpileup file

        OPTIONS:
        --min-coverage  Minimum read depth at a position to make a call [8]
        --min-reads2    Minimum supporting reads at a position to call variants [2]
        --min-avg-qual  Minimum base quality at a position to count a read [15]
        --min-var-freq  Minimum variant allele frequency threshold [0.01]
        --min-freq-for-hom      Minimum frequency to call homozygote [0.75]
        --p-value       Default p-value threshold for calling variants [99e-02]
        --strand-filter Ignore variants with >90% support on one strand [1]
        --output-vcf    If set to 1, outputs in VCF format
        --vcf-sample-list       For VCF output, a list of sample names in order, one per line
        --variants      Report only variant (SNP/indel) positions [0]

plot statistics from the varscan2 calls

bcftools stats -F <ref.fa> -s - <varscan-results.vcf.gz> > <varscan-results.vcf.gz.stats>
mkdir varscan_plots
plot-vcfstats -p varscan_plots/ <varscan-results.vcf.gz.stats>

Comparing BWA & Bowtie2 results

The main question motivating this analysis was to quantify the impact of the choice of the mapper on the final result being the list of predicted variants in the NA18507 genome. The data selected for this experiment is ideal in some aspects and biased in others as detailed below.

plus points

  • this SRA experiment is of high coverage
  • the reads are paired which facilitates their unambiguous mapping
  • the read length is sufficient to favor unique mapping
  • all steps were performed in one go using the same software tools and identical reference genome (hg19)
  • bowtie2 is used here and is more compared to bwa than bowtie1 usually reported in similar comparison studies

minus points

  • the reads used here represent the chr21 subset of all reads mapped using different tools in a former mapping (against hg18) and therefore lack unmapped reads or ambiguous reads that could slightly change the results
  • additional cleanup using GATK would be good to improve the indel part of the calls and reduce FPR in the final results (but should only slightlty change the global results)

intersecting results with VCFTools

We fist extract the chr21 calls from each vcf file to compare apples with apples

filter chr21 calls

# extract the chr21 subset of calls
# vcftools --gzvcf <full-genome.vcf.gz> \
#	--chr chr21 \
#	--out <chr21.vcf.gz> \
#	--recode
 
## repeat this for both bwa and bowtie2 VCF sets flagged here with <met>
 
# bowtie calls
vcftools --gzvcf results.vcf.gz \
	--chr chr21 \
	--out chr21_ \
	--recode
 
# perform some cleanup
bgzip -c chr21_.recode.vcf > <met>_chr21_results.vcf.gz && rm chr21_.recode.vcf
tabix -p vcf <met>_chr21_results.vcf.gz
 
# varscan calls
vcftools --gzvcf varscan-results.vcf.gz \
	--chr chr21 \
	--out chr21_varscan \
	--recode
 
# perform some cleanup
bgzip -c chr21_varscan.recode.vcf > <met>_chr21_varscan-results.vcf.gz && rm chr21_varscan.recode.vcf
tabix -p vcf <met>_chr21_results.vcf.gz

We then intersect the 4 files using vcftools to get data for a Venn plot

vcftools intersect 4 chr21 call-sets

# define input files to compare
bowtie_bcftools=bowtie2_NA18507-data/bowtie2_chr21_results.vcf.gz
bowtie_varscan=bowtie2_NA18507-data/bowtie2_chr21_varscan-results.vcf.gz
bwa_bcftools=bwa_NA18507-data/bwa_chr21_results.vcf.gz
bwa_varscan=bwa_NA18507-data/bwa_chr21_varscan-results.vcf.gz
 
vcf-compare ${bowtie_bcftools} \
        ${bowtie_varscan} \
	${bwa_bcftools} \
        ${bwa_varscan} | \
	grep ^VN | cut -f 2- | column -t \
	> vcf-compare_4.txt
 
# substitute names to ease reading
perl -p -e 's:bowtie2_NA18507-data/bowtie2_chr21_results.vcf.gz:A:g' vcf-compare_4.txt | \
    perl -p -e 's:bowtie2_NA18507-data/bowtie2_chr21_varscan-results.vcf.gz:B:g' | \
    perl -p -e 's:bwa_NA18507-data/bwa_chr21_results.vcf.gz:C:g' | \
    perl -p -e 's:bwa_NA18507-data/bwa_chr21_varscan-results.vcf.gz:D:g' | column -t
 
8      A  (0.0%)   D  (0.0%)
109    A  (0.1%)   B  (0.1%)   D  (0.1%)
179    D  (0.2%)
267    A  (0.3%)   B  (0.3%)
304    A  (0.4%)   C  (0.3%)   D  (0.4%)
304    B  (0.4%)   C  (0.3%)
610    A  (0.7%)   B  (0.7%)   C  (0.7%)
684    B  (0.8%)
835    C  (0.9%)   D  (1.0%)
1942   B  (2.4%)   D  (2.4%)
2038   A  (2.5%)
3899   B  (4.8%)   C  (4.4%)   D  (4.8%)
4100   C  (4.6%)
4933   A  (6.0%)   C  (5.5%)
74160  A  (90.0%)  B  (90.5%)  C  (83.2%)  D  (91.1%)
 
# the obtained value can be fed to the in-house developed Venn plotting tool
4DVenn.R --ad.count=8 --abd.count=109 --d.count=179 --ab.count=267 --acd.count=304 --bc.count=304 \
    --abc.count=610 --b.count=684 --cd.count=835 --bd.count=1942 --a.count=2038 --bcd.count=3899 \
    --c.count=4100 --ac.count=4933 --abcd.count=74160 -t "bowtie2 bwa comparison" -x 1 -o 4Dvenn -u 2

Legend:

  • A=bowtie2_bcftools
  • B=bowtie2_varscan
  • C=bwa_bcftools
  • D=bwa_varscan


4Dvenn.png

comparing results to the validated data for NA18507

Intersecting calls should be complemented by the validation of the randomly selected variant calls. We present here only the intersection with HapMap3 data obtained from the Broad ftp site.

Comparing bowtie and bwa results to HapMap3 validated variants

filter chr21 calls

# define input files to compare
bowtie_bcftools=bowtie2_chr21_results.vcf.gz
bowtie_varscan=bowtie2_chr21_varscan-results.vcf.gz
bwa_bcftools=bwa_chr21_results.vcf.gz
bwa_varscan=bwa_chr21_varscan-results.vcf.gz
hapmap=chr21_NA18507_hapmap_3.3.hg19.vcf.gz
 
## bowtie2 data
vcf-compare ${bowtie_bcftools} \
	${bowtie_varscan} \
        ${hapmap} | \
	grep ^VN | cut -f 2- | column -t \
	> vcf-bowtie-hapmap-compare.txt
 
# substitute names to ease reading
perl -p -e 's:bowtie2_chr21_results.vcf.gz:A:g' vcf-bowtie-hapmap-compare.txt | \
    perl -p -e 's:bowtie2_chr21_varscan-results.vcf.gz:B:g' | \
    perl -p -e 's:chr21_NA18507_hapmap_3.3.hg19.vcf.gz:C:g' | \
    column -t > vcf-bowtie-hapmap-compare_ref.txt
 
# plotting
3DVenn.R --bc.count=16 --ac.count=41 --c.count=66 --b.count=6813 \
    --a.count=7242 --abc.count=9929 --ab.count=65217 \
    -t "bowtie vs HapMap3.3" -o bowtie-vs-hapmap -u 2 -x 1
 
## bwa data
vcf-compare ${bwa_bcftools} \
	${bwa_varscan} \
        ${hapmap} | \
	grep ^VN | cut -f 2- | column -t \
	> vcf-bwa-hapmap-compare.txt
 
# substitute names to ease reading
perl -p -e 's:bwa_chr21_results.vcf.gz:A:g' vcf-bwa-hapmap-compare.txt | \
    perl -p -e 's:bwa_chr21_varscan-results.vcf.gz:B:g' | \
    perl -p -e 's:chr21_NA18507_hapmap_3.3.hg19.vcf.gz:C:g' | \
    column -t > vcf-bwa-hapmap-compare_ref.txt
 
# plotting
3DVenn.R --bc.count=0 --ac.count=42 --c.count=65 --b.count=2238 \
    --a.count=9905 --abc.count=9945 --ab.count=69253 \
    -t "bwa vs HapMap3.3" -o bwa-vs-hapmap -u 2 -x 1

Legend:

  • A=bowtie2_bcftools
  • B=bowtie2_varscan
  • C=HapMap3.3


bowtie-vs-hapmap.png

Legend:

  • A=bwa_bcftools
  • B=bwa_varscan
  • C=HapMap3.3


bwa-vs-hapmap.png

Conclusion

Several conclusion can be drawn from these results:

  • the recall between both Bowtie2 ands Bwa mapping data is quite high and this is comforting for users prefering one maper tothe other. Keep however in mind that this is an ideal dataset with high amount of reads and that comes from pre-mapped data and therefore does not contain many difficult reads.
  • the variants validated by HapMap are found in very similar wether the data is obtained from bowtie or bwa.
  • the choice of the caller does not seem to cause large differences in count or recall by HapMap, suggesting that users may choose between bcftools or varscan2
  • Varscan2 calls less private variants and is 'seems' less noisy. It is likely that filtering out low quality mappings before calling could reverse this.
  • one additional control should be comparing all these calls with those obtained from the GATK post-processed workflow. This goes far beyond the simple aim of this experiment and some elements of answer can be found in our separate hands-on results presented in Hands-on_introduction_to_NGS_variant_analysis

 


References:
  1. http://www.htslib.org/workflow/
  2. http://varscan.sourceforge.net/#why-use-varscan

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis ]