NGS Exercise.5
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.4 | NGS Exercise.5_GATK | NGS Exercise.6 ]
# updated 2014 version
Call variations by comparing the aligned reads with the reference sequence
Contents
- 1 quick intro on VCF format
- 2 About bgzip-compressed VCF data and indexing
- 3 call SNV and short indel variants
- 4 review bcftools (samtools) variant calls
- 5 perform QC on the obtained VCF files
- 6 download exercise files
quick intro on VCF format
Since the expansion of the 1000 genome project[1], the Variant Call Format has become more and more popular and is today the default format to represent sequence variation. VCF is a tabular text format that provides rich information about each position different from the reference genome. It also includes different scores obtained during sequencing, alignment, and calling to allow quality filtering as well as added sequence annotations to allow annotation-driven filtering. Various tools have been developed to manipulate VCF data and are exemplified during this session.
An example of VCF data is provided here as a primer, users will get more information in NGS-formats or from the VCF documentation <http://en.wikipedia.org/wiki/Variant_Call_Format>[2].
About bgzip-compressed VCF data and indexing
Similarly to SAM vs BAM and in order to speed VCF compatible tools, raw VCF data needs to be sorted and indexed. We created for this purpose a simple 'bash custom function' named vcf2index that is 'defined (declared)' on your training laptop. The function will be applied below to generate binary compressed gz versions and indexes for each VCF result.
the vcf2index custom bash function
vcf2index () { # takes a vcf file as argument ($1) # keeps header and sorts remaining lines # compresses sorted file with bgzip # indexes it with tabix for vcftools applications # author: Stephane Plaisance # version: 1.0 if [ $# -eq 1 ]; then title=$(basename "$1" ".vcf"); ( grep ^"#" $1 | perl -pi -e "s/$title.*$/$title/"; grep -v ^"#" $1 | sort -k 1V,1 -k 2n,2 ) | bgzip -c > $1".gz"; tabix -p vcf $1".gz"; else echo "usage: vcf2index <vcf-file>"; fi }
Custom functions sourced from the '.bashrc' configuration file (default behavior on the BITS laptops) are only active in the terminal but not from within scripts. To use your custom functions in scripts, simply source the function-containing file in the begining of your script (our custom functions are stored under '$HOME/.myfunctions')
Effect of sourcing custom functions in bash script
! /usr/bin/env bash # 1) save this code in a file called 'test.sh' anywhere in your path # 2) change the file rights to become executable: 'chmod a+x test.sh' # 3) execute 'test.sh' and see what happens # list vcf2index echo "# try before sourcing" declare -f vcf2index # list vcf2index echo "# try again after sourcing" . $HOME/.myfunctions declare -f vcf2index
call SNV and short indel variants
A wikibook page (in construction) is dedicated to variant calling from NGS(<http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/DNA_Variants>).
Several methods are available to convert read mapping information to variant calls. They are all based on local pileup alignment to the reference genome and consensus extraction but different callers use different strategies to limit the effect of wrongly mapped or low quality read sequences on the final result. We used here samtools and bcftools to call SNVs and indels rather than the more performant but difficult to use GATK competitor (see NGS_Exercise.5_GATK).
call variants with samtools and samtools bcftools
Samtools in recent versions has replaces pileup with mpileup. Keep this in mind when you read online documentation as functionalities have significantly changed with mpileup. During the workflow presented below, several steps are undergone starting with mpileup of reads, assembly of the results with bcftools to produce vcf data, and finally compress the obtained VCF (all in one smooth piped workflow to avoid producing intermediate result files and save disk space).
various manpages
samtools mpileup arguments
Usage: samtools mpileup [options] in1.bam [in2.bam [...]] Input options: -6 assume the quality is in the Illumina-1.3+ encoding -A count anomalous read pairs -B disable BAQ computation -b FILE list of input BAM filenames, one per line [null] -C INT parameter for adjusting mapQ; 0 to disable [0] -d INT max per-BAM depth to avoid excessive memory usage [250] -E recalculate extended BAQ on the fly thus ignoring existing BQs -f FILE faidx indexed reference sequence file [null] -G FILE exclude read groups listed in FILE [null] -l FILE list of positions (chr pos) or regions (BED) [null] -M INT cap mapping quality at INT [60] -r STR region in which pileup is generated [null] -R ignore RG tags -q INT skip alignments with mapQ smaller than INT [0] -Q INT skip bases with baseQ/BAQ smaller than INT [13] --rf INT required flags: skip reads with mask bits unset [] --ff INT filter flags: skip reads with mask bits set [] Output options: -D output per-sample DP in BCF (require -g/-u) -g generate BCF output (genotype likelihoods) -O output base positions on reads (disabled by -g/-u) -s output mapping quality (disabled by -g/-u) -S output per-sample strand bias P-value in BCF (require -g/-u) -u generate uncompress BCF output SNP/INDEL genotype likelihoods options (effective with `-g' or `-u'): -e INT Phred-scaled gap extension seq error probability [20] -F FLOAT minimum fraction of gapped reads for candidates [0.002] -h INT coefficient for homopolymer errors [100] -I do not perform indel calling -L INT max per-sample depth for INDEL calling [250] -m INT minimum gapped reads for indel candidates [1] -o INT Phred-scaled gap open sequencing error probability [40] -p apply -m and -F per-sample to increase sensitivity -P STR comma separated list of platforms for indels [all] Notes: Assuming diploid individuals.
bcftools view (samtools) arguments
Usage: bcftools view [options] <in.bcf> [reg] Input/output options: -A keep all possible alternate alleles at variant sites -b output BCF instead of VCF -D FILE sequence dictionary for VCF->BCF conversion [null] -F PL generated by r921 or before (which generate old ordering) -G suppress all individual genotype information -l FILE list of sites (chr pos) or regions (BED) to output [all sites] -L calculate LD for adjacent sites -N skip sites where REF is not A/C/G/T -Q output the QCALL likelihood format -s FILE list of samples to use [all samples] -S input is VCF -u uncompressed BCF output (force -b) Consensus/variant calling options: -c SNP calling (force -e) -d FLOAT skip loci where less than FLOAT fraction of samples covered [0] -e likelihood based analyses -g call genotypes at variant sites (force -c) -i FLOAT indel-to-substitution ratio [-1] -I skip indels -m FLOAT alternative model for multiallelic and rare-variant calling, include if P(chi^2)>=FLOAT -p FLOAT variant if P(ref|D)<FLOAT [0.5] -P STR type of prior: full, cond2, flat [full] -t FLOAT scaled substitution mutation rate [0.001] -T STR constrained calling; STR can be: pair, trioauto, trioxd and trioxs (see manual) [null] -v output potential variant sites only (force -c) Contrast calling and association test options: -1 INT number of group-1 samples [0] -C FLOAT posterior constrast for LRT<FLOAT and P(ref|D)<0.5 [1] -U INT number of permutations for association testing (effective with -1) [0] -X FLOAT only perform permutations for P(chi^2)<FLOAT [0.01]
The classical method
Given the observed average genome coverage of ~40x, very high depth loci can only originate from copy number effects and should not be trusted. The vcfutils.pl varFilter -D1000 filtering step ensures that read depth above 1000 will not be taken into account to call variants (Excessive piling up of reads is indicative of poorly mapped region). Additional filtering can be performed based on quality scores or several other useful parameters listed below. Please feel free to play with these back in the lab.
#! /usr/bin/env bash # script samtools_call-variants.sh # required # samtools Version: 0.1.19 # bcftools(_samtools) view (samtools version), vcfutils(_samtools).pl # bcftools(_htslib) call, vcfutils(_htslib).pl # source user-defined functions (including vcf2index) . /home/bits/.myfunctions # we call from the full mapping results for chr21 provided in 'Final_Results' # calling from the low coverage training data would not give confident calls infolder=hg19_bwa-mapping/full_chromosome name=shuffled_PE_NA18507_GAIIx_100_chr21 # shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam # shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam outfolder=bcftools_samtools_variants mkdir -p ${outfolder} ref=ref/HiSeq_UCSC_hg19.fa # call variants from pileup, filter and convert calls to VCF format # exclude calls where more than 1000 reads support a variation for met in "aln" "mem"; do # samtools mpileup arguments # -u generate uncompress BCF output (to feed the pipe) # (not used) -r region in which pileup is generated (chr21) # -f reference genome in fasta format related to the BWA index # samtools bcftools view arguments # -b output BCF instead of VCF # -v output potential variant sites only (force -c) # -c SNP calling (force -e) # -e likelihood based analyses (inherited from -c) # -g call genotypes at variant sites (force -c) ## REM: the next command is made of two parts contained between '(...)' with separate error logs (2>...) ## only if (part-1) succeeds ('&&') => then (part-2) is started. ## part-1 creates a raw BCF file which is then filtered in part-2 to lead to the VCF data (samtools mpileup -uf ${ref} ${infolder}/${name}_${met}-pe_mdup_chr21.bam | \ bcftools view -bvceg - > \ ${outfolder}/NA18507_${met}_var.raw.bcf \ 2>${outfolder}/samtools_mpileup_${met}_raw.err) && \ (bcftools view ${outfolder}/NA18507_${met}_var.raw.bcf | \ vcfutils.pl varFilter -D1000 > \ ${outfolder}/NA18507_${met}_var.flt-D1000.vcf \ 2>${outfolder}/samtools_mpileup_${met}_filter.err) # Sort, compress, and index the variant VCF files vcf2index ${outfolder}/NA18507_${met}_var.flt-D1000.vcf && \ rm ${outfolder}/NA18507_${met}_var.flt-D1000.vcf # extract the chr21 subset of calls vcftools --gzvcf ${outfolder}/NA18507_${met}_var.flt-D1000.vcf.gz \ --chr chr21 \ --out ${outfolder}/chr21_NA18507_${met}_var.flt-D1000 \ --recode # recoding introduces the word recode at the end of the file-name # remove the recode tag from the name mv ${outfolder}/chr21_NA18507_${met}_var.flt-D1000.recode.vcf \ ${outfolder}/chr21_NA18507_${met}_var.flt-D1000.vcf # sort, compress and index using our custom 'vcf2index' function) vcf2index ${outfolder}/chr21_NA18507_${met}_var.flt-D1000.vcf && \ rm ${outfolder}/chr21_NA18507_${met}_var.flt-D1000.vcf done
call variants with samtools version 1.x and bcftools 1.x (both using htslib)
various manpages
samtools 1.3 (htslib) arguments
Program: samtools (Tools for alignments in the SAM format) Version: 1.3 (using htslib 1.3) Usage: samtools <command> [options] Commands: -- Indexing dict create a sequence dictionary file faidx index/extract FASTA index index alignment -- Editing calmd recalculate MD/NM tags and '=' bases fixmate fix mate information reheader replace BAM header rmdup remove PCR duplicates targetcut cut fosmid regions (for fosmid pool only) addreplacerg adds or replaces RG tags -- File operations collate shuffle and group alignments by name cat concatenate BAMs merge merge sorted alignments mpileup multi-way pileup sort sort alignment file split splits a file by read group quickcheck quickly check if SAM/BAM/CRAM file appears intact fastq converts a BAM to a FASTQ fasta converts a BAM to a FASTA -- Statistics bedcov read depth per BED region depth compute the depth flagstat simple stats idxstats BAM index stats phase phase heterozygotes stats generate stats (former bamcheck) -- Viewing flags explain BAM flags tview text alignment viewer view SAM<->BAM<->CRAM conversion depad convert padded BAM to unpadded BAM
bcftools 1.3 (htslib) arguments
Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs) Version: 1.1 (using htslib 1.1) Usage: bcftools <command> <argument> Commands: -- Indexing index index VCF/BCF files -- VCF/BCF manipulation annotate annotate and edit VCF/BCF files concat concatenate VCF/BCF files from the same set of samples convert convert VCF/BCF files to different formats and back isec intersections of VCF/BCF files merge merge VCF/BCF files files from non-overlapping sample sets norm left-align and normalize indels plugin user-defined plugins query transform VCF/BCF into user-defined formats reheader modify VCF/BCF header, change sample names view VCF/BCF conversion, view, subset and filter VCF/BCF files -- VCF/BCF analysis call SNP/indel calling filter filter VCF/BCF files using fixed thresholds gtcheck check sample concordance, detect sample swaps and contamination roh identify runs of autozygosity (HMM) stats produce VCF/BCF stats Most commands accept VCF, bgzipped VCF, and BCF with the file type detected automatically even when streaming from a pipe. Indexed VCF and BCF will work in all situations. Un-indexed VCF and BCF and streams will work in most but not all situations.
bcftools call (htslib) arguments
About: SNP/indel variant calling from VCF/BCF. To be used in conjunction with samtools mpileup. This command replaces the former "bcftools view" caller. Some of the original functionality has been temporarily lost in the process of transition to htslib, but will be added back on popular demand. The original calling model can be invoked with the -c option. Usage: bcftools call [options] <in.vcf.gz> File format options: -o, --output <file> write output to a file [standard output] -O, --output-type <b|u|z|v> output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v] -r, --regions <region> restrict to comma-separated list of regions -R, --regions-file <file> restrict to regions listed in a file -s, --samples <list> list of samples to include [all samples] -S, --samples-file <file> PED file or a file with optional second column for ploidy (0, 1 or 2) [all samples] -t, --targets <region> similar to -r but streams rather than index-jumps -T, --targets-file <file> similar to -R but streams rather than index-jumps Input/output options: -A, --keep-alts keep all possible alternate alleles at variant sites -f, --format-fields <list> output format fields: GQ,GP (lowercase allowed) [] -g, --gvcf <minDP> output gVCF blocks of homozygous REF calls -i, --insert-missed output also sites missed by mpileup but present in -T -M, --keep-masked-ref keep sites with masked reference allele (REF=N) -V, --skip-variants <type> skip indels/snps -v, --variants-only output variant sites only Consensus/variant calling options: -c, --consensus-caller the original calling method (conflicts with -m) -C, --constrain <str> one of: alleles, trio (see manual) -m, --multiallelic-caller alternative model for multiallelic and rare-variant calling (conflicts with -c) -n, --novel-rate <float>,[...] likelihood of novel mutation for constrained trio calling, see man page for details [1e-8,1e-9,1e-9] -p, --pval-threshold <float> variant if P(ref|D)<FLOAT with -c [0.5] -P, --prior <float> mutation rate [1e-3] -X, --chromosome-X haploid output for male samples (requires PED file with -s) -Y, --chromosome-Y haploid output for males and skips females (requires PED file with -s)
bcftools view (htslib) arguments
About: VCF/BCF conversion, view, subset and filter VCF/BCF files. Usage: bcftools view [options] <in.vcf.gz> [region1 [...]] Output options: -G, --drop-genotypes drop individual genotype information (after subsetting if -s option set) -h/H, --header-only/--no-header print the header only/suppress the header in VCF output -l, --compression-level [0-9] compression level: 0 uncompressed, 1 best speed, 9 best compression [-1] -o, --output-file <file> output file name [stdout] -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v] -r, --regions <region> restrict to comma-separated list of regions -R, --regions-file <file> restrict to regions listed in a file -t, --targets [^]<region> similar to -r but streams rather than index-jumps. Exclude regions with "^" prefix -T, --targets-file [^]<file> similar to -R but streams rather than index-jumps. Exclude regions with "^" prefix Subset options: -a, --trim-alt-alleles trim alternate alleles not seen in the subset -I, --no-update do not (re)calculate INFO fields for the subset (currently INFO/AC and INFO/AN) -s, --samples [^]<list> comma separated list of samples to include (or exclude with "^" prefix) -S, --samples-file [^]<file> file of samples to include (or exclude with "^" prefix) --force-samples only warn about unknown subset samples Filter options: -c/C, --min-ac/--max-ac <int>[:<type>] minimum/maximum count for non-reference (nref), 1st alternate (alt1) or minor (minor) alleles [nref] -f, --apply-filters <list> require at least one of the listed FILTER strings (e.g. "PASS,.") -g, --genotype [^]<hom|het|miss> require one or more hom/het/missing genotype or, if prefixed with "^", exclude sites with hom/het/missing genotypes -i/e, --include/--exclude <expr> select/exclude sites for which the expression is true (see man page for details) -k/n, --known/--novel select known/novel sites only (ID is not/is '.') -m/M, --min-alleles/--max-alleles <int> minimum/maximum number of alleles listed in REF and ALT (e.g. -m2 -M2 for biallelic sites) -p/P, --phased/--exclude-phased select/exclude sites where all samples are phased -q/Q, --min-af/--max-af <float>[:<type>] minimum/maximum frequency for non-reference (nref), 1st alternate (alt1) or minor (minor) alleles [nref] -u/U, --uncalled/--exclude-uncalled select/exclude sites without a called genotype -v/V, --types/--exclude-types <list> select/exclude comma-separated list of variant types: snps,indels,mnps,other [null] -x/X, --private/--exclude-private select/exclude sites where the non-reference alleles are exclusive (private) to the subset samples
bcftools filter (htslib) arguments
About: Apply fixed-threshold filters. Usage: bcftools filter [options] <in.vcf.gz> Options: -e, --exclude <expr> exclude sites for which the expression is true (see man page for details) -g, --SnpGap <int> filter SNPs within <int> base pairs of an indel -G, --IndelGap <int> filter clusters of indels separated by <int> or fewer base pairs allowing only one to pass -i, --include <expr> include only sites for which the expression is true (see man page for details -m, --mode [+x] "+": do not replace but add to existing FILTER; "x": reset filters at sites which pass -o, --output <file> write output to a file [standard output] -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v] -r, --regions <region> restrict to comma-separated list of regions -R, --regions-file <file> restrict to regions listed in a file -s, --soft-filter <string> annotate FILTER column with <string> or unique filter name ("Filter%d") made up by the program ("+") -S, --set-GTs <.|0> set genotypes of failed samples to missing (.) or ref (0) -t, --targets <region> similar to -r but streams rather than index-jumps -T, --targets-file <file> similar to -R but streams rather than index-jumps
the script using the recent htslib versions of samtools and bcftools
#! /usr/bin/env bash # script bcftools_call-variants.sh # required # samtools v1.x (htslib) # bcftools v1.x (htslib) # vcf2index custom function # make sure you have defined VARSCAN in your .bashrc # export VARSCAN=<edit here the path to VarScan.v2.4.1.jar> # if not done yet, source user-defined functions (including vcf2index) # . /home/bits/.myfunctions # we call from the full mapping results for chr21 # calling from the low coverage training data would not give confident calls infolder=hg19_bwa-mapping/ infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam name=${infile%%.bam} outfolder=bcftools_htslib_variants mkdir -p ${outfolder2} ref=ref/HiSeq_UCSC_hg19.fa # call variants from pileup, filter and convert calls to VCF format # samtools mpileup arguments # -u generate uncompress BCF output (to feed the pipe) # -f reference genome in fasta format related to the BWA index # bcftools call arguments # -v output potential variant sites only (force -c) # -c SNP calling (force -e) # -O u for bcf-uncompressed output # exclude calls where more than 1000 reads support a variation ## REM: the next command is made of two parts contained between '(...)' with separate error logs (2>...) ## only if (part-1) succeeds ('&&') => then (part-2) is started. ## part-1 creates a raw BCF file (samtools mpileup -uf ${ref} ${infolder}/${infile} | \ bcftools call -vc -O u > ${outfolder}/NA18507_var_bcftools.raw.bcf \ 2>${outfolder}/samtools_mpileup_bcftools_raw.err) && \ ## part-2 processes the BCF data, filters it, and creates a VCF dataset (bcftools view ${outfolder}/NA18507_var_bcftools.raw.bcf | \ vcfutils varFilter -D1000 > \ ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf \ 2>${outfolder}/samtools_mpileup_bcftools_filter.err) # Sort, compress, and index the variant VCF files vcf2index ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf && \ rm ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf # extract the chr21 subset of calls vcftools --gzvcf ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf.gz \ --chr chr21 \ --out ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000 \ --recode # recoding introduces the word recode at the end of the file-name # remove the recode tag from the name to simplify mv ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.recode.vcf \ ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf # sort, compress and index using our custom 'vcf2index' function) vcf2index ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf && \ rm ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf echo
call variants with varScan2
While Samtools calls variants using a probabilistic approach, Varscan2 [3] instead uses a more basic approach of filtering for what users consider as background. Several cutoffs can be defined to filter-out 'bad' mappings and keep 'good' ones for calling.
This method is not discussed during the session but is presented here so that you can run it back in the lab and compare results with those obtained using samtools
default varscan2 built-in filtering parameters are:
- --min-coverage Minimum read depth at a position to make a call [8]
- --min-reads2 Minimum supporting reads at a position to call variants [2]
- --min-var-freq Minimum variant allele frequency threshold [0.20]*(0.01)
- --min-avg-qual Minimum base quality at a position to count a read [15]
- --min-freq-for-hom Minimum frequency to call homozygote [0.75]?
- --p-value Default p-value threshold for calling variants [0.01]*(0.99)
- --strand-filter Ignore variants with >90% support on one strand [1]?
Remarks:
(*) values reported during a dry run are different from the ones in the manual page
(?) are not reported and their default value could not be verified
read more in the varscan2 documentation pages
available varscan methods
methods for germline variants
- mpileup2snp - calls single nucleotide polymorphisms (SNPs)
- mpileup2indel - calls insertions and deletions (indels)
- mpileup2cns - calls a consensus genotype (reference, SNP, or indel)
The first two (mpileup2snp and mpileup2indel) report *only* positions at which a variant of the given type (SNP and indel) was called. The third command (mpileup2cns) reports all positions that met the miniumum coverage, or (with the -v parameter), all positions at which a SNP or an indel was called.
method for somatic variants
this method expects two sequencing datasets, one for the tumor sample and one for the 'normal' matched sample. The syntax of the command for somatic mutation calling differs somewhat from germline calling subcommands.
java -jar VarScan.jar somatic normal.pileup tumor.pileup output.basename
The above command will report germline, somatic, and LOH events at positions where both normal and tumor samples have sufficient coverage (default: 8). If the --validation flag is set to 1, it also report non-variant (reference) calls at all positions with sufficient coverage, for the purpose of refuting false positives with validation data.
method for somatic CNV variants
A separate method (not discussed here) calls Somatic Copy Number Alteration (CNA) from cancer/normal sample pairs.
VarScan v2.2.4 and later includes novel functionality to infer somatic copy number changes using data from matched tumor-normal pairs. This is designed specifically for exome sequencing, in which a tumor sample and its matched normal were captured and sequenced under identical conditions. By performing a pairwise comparison of read depth between the samples at each position in the exome, it becomes possible to infer relative changes in copy number in the tumor sample. The output of this tool is a set of regions with chromosome, start, stop, and log-base-2 of the copy number change, which is similar to array-based copy number data and therefore amenable to the same segmentation algorithms.
call germline SNVs and Indels together
#! /usr/bin/env bash # script varscan2-mpileup2cns_call-variants.sh # required # samtools Version: 0.1.19 # varscan2 version 2.3.6 # source user-defined functions (including vcf2index) . /home/bits/.myfunctions # we call from the full mapping results for chr21 provided in 'Final_Results' # calling from the low coverage training data would not give confident calls infolder=hg19_bwa-mapping/full_chromosome name=shuffled_PE_NA18507_GAIIx_100_chr21 # shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam # shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam outfolder=varscan2_variants mkdir -p ${outfolder} ref=ref/HiSeq_UCSC_hg19.fa # call SNV and InDels variants from samtools mpileup for met in "aln" "mem"; do ## call both in one GO samtools mpileup -f ${ref} ${infolder}/${name}_${met}-pe_mdup_chr21.bam | \ java -Xmx4G -jar /usr/bin/tools/VarScan.v2.3.7.jar mpileup2cns --variants --output-vcf 1 \ >${outfolder}/NA18507_${met}_mpileup2cns.vcf \ 2>${outfolder}/varscan2_mpileup2cns_${met}.err ## compress and index result bgzip -c ${outfolder}/NA18507_${met}_mpileup2cns.vcf \ > ${outfolder}/NA18507_${met}_mpileup2cns.vcf.gz && tabix -p vcf ${outfolder}/NA18507_${met}_mpileup2cns.vcf.gz # extract the chr21 subset of calls vcftools --gzvcf ${outfolder}/NA18507_${met}_mpileup2cns.vcf.gz \ --chr chr21 \ --out ${outfolder}/NA18507_${met}_varscan_chr21 \ --recode # recoding introduces the word recode at the end of the file-name # remove the recode tag from the name mv ${outfolder}/NA18507_${met}_varscan_chr21.recode.vcf \ ${outfolder}/NA18507_${met}_varscan_chr21.vcf # sort, compress and index using our custom 'vcf2index' function) vcf2index ${outfolder}/NA18507_${met}_varscan_chr21.vcf && \ rm ${outfolder}/NA18507_${met}_varscan_chr21.vcf done
review germline SNVs and Indels calls
34023715 bases in pileup file 81628 variant positions (70998 SNP, 10630 indel) 931 were failed by the strand-filter 80697 variant positions reported (70141 SNP, 10556 indel)
# Varscan2 results for the 'mem' mpileup
34025577 bases in pileup file 82408 variant positions (71478 SNP, 10930 indel) 979 were failed by the strand-filter 81429 variant positions reported (70569 SNP, 10860 indel)
call germline SNVs and Indels separately
bash script to call SNVs and Indels
#! /usr/bin/env bash # script varscan2_call-variants.sh # required # samtools Version: 0.1.19 # varscan2 version 2.3.6 # source user-defined functions (including vcf2index) . $HOME/.myfunctions # we call from the full mapping results for chr21 provided in 'Final_Results' # calling from the low coverage training data would not give confident calls infolder=Final_Results/hg19_bwa-mapping name=shuffled_PE_NA18507_GAIIx_100_chr21 # shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam # shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam outfolder=varscan2_variants mkdir -p ${outfolder} ref=ref/HiSeq_UCSC_hg19.fa # call SNV and InDels variants from samtools mpileup for met in "aln" "mem"; do ## call SNVs ## send results (STDOUT) to a VCF file and comments or errors (STDERR) to a separate '.err' file samtools_0.1.19 mpileup -f ${ref} ${infolder}/${name}_${met}-pe_mdup_chr21.bam | \ java -Xmx4G -jar ${BIOTOOLS}/varscan2/VarScan.v2.3.6.jar mpileup2snp --variants --output-vcf 1 \ >${outfolder}/NA18507_${met}_mpileup2snp.vcf \ 2>${outfolder}/varscan2_mpileup2snp_${met}.err ## call Indels ## send results (STDOUT) to a VCF file and comments or errors (STDERR) to a separate '.err' file samtools_0.1.19 mpileup -f ${ref} ${infolder}/${name}_${met}-pe_mdup_chr21.bam | \ java -Xmx4G -jar ${BIOTOOLS}/varscan2/VarScan.v2.3.6.jar mpileup2indel --variants --output-vcf 1 \ >${outfolder}/NA18507_${met}_mpileup2indel.vcf \ 2>${outfolder}/varscan2_mpileup2indel_${met}.err # concat, sort, compress, and index the variant VCF files cat ${outfolder}/NA18507_${met}_mpileup2snp.vcf \ ${outfolder}/NA18507_${met}_mpileup2indel.vcf | \ vcf-sort -c | bgzip -c > \ ${outfolder}/NA18507_${met}_varscan.vcf.gz && tabix -p vcf ${outfolder}/NA18507_${met}_varscan.vcf.gz # extract the chr21 subset of calls vcftools --gzvcf ${outfolder}/NA18507_${met}_varscan.vcf.gz \ --chr chr21 \ --out ${outfolder}/NA18507_${met}_varscan_chr21 \ --recode # recoding introduces the word recode at the end of the file-name # remove the recode tag from the name mv ${outfolder}/NA18507_${met}_varscan_chr21.recode.vcf \ ${outfolder}/NA18507_${met}_varscan_chr21.vcf # sort, compress and index using our custom 'vcf2index' function) vcf2index ${outfolder}/NA18507_${met}_varscan_chr21.vcf && \ rm ${outfolder}/NA18507_${met}_varscan_chr21.vcf done
review germline SNVs and Indels calls
SNVs
34023715 bases in pileup file 81628 variant positions (70998 SNP, 10630 indel) 931 were failed by the strand-filter 70141 variant positions reported (70141 SNP, 0 indel)
INDELs
34023715 bases in pileup file 81628 variant positions (70998 SNP, 10630 indel) 931 were failed by the strand-filter 10556 variant positions reported (0 SNP, 10556 indel)
# Varscan2 results for the 'mem' mpileup
SNVs
34025577 bases in pileup file 82408 variant positions (71478 SNP, 10930 indel) 979 were failed by the strand-filter 70569 variant positions reported (70569 SNP, 0 indel)
INDELs
34025577 bases in pileup file 82408 variant positions (71478 SNP, 10930 indel) 979 were failed by the strand-filter 10860 variant positions reported (0 SNP, 10860 indel)
call high quality SNP and Indel variants using the Broad GATK
The Broad Institute is continuously developing a framework used for the 1000genome project and that has become the gold standard in NGS DNA analysis. The GATK toolbox is today high-end standard but is not covered in this training due to license limitation for the use of Broad software. Any scientist from a University lab is allowed to download and use GATK but access to the software becomes more tricky for us as a non-profit institute facility.
A comparative analysis of results obtained from the bwa mem mapping followed by bcftools or GATK calling is provided in a separate document for your convenience but will not be included in this session. (NGS Exercise.5_GATK: Call variations using the GATK)
review bcftools (samtools) variant calls
Variant calls obtained after remapping chr21 reads to the full human genome are expected to also include a small number of variants on other chromosomes than chr21. This is due to the high sequence similarity of chromosomal loci across the genome and to the ambiguous nature of short read mapping.
Because we took the subset of reads that re-mapped to chr21 with BWA, we will not 'see' this off-target effect
# analysis of 'aln' calls infolder=bcftools_samtools_variants infile=NA18507_aln_var.flt-D1000.vcf.gz zgrep -v "^#" ${infolder}/${infile} | \ cut -f 1 | sort | uniq -c | sort -k 2V,2
variant counts for the aln method
83416 chr21
As seen here *samtools variants are predicted on all chromosomes while the reads were selected based on original chr21 mapping information. This is common in genome sequencing and reflects the redundant nature of the human genome where regions of high similarity are present in different chromosomes and render mapping of short reads ambiguous.
# analysis of 'mem' calls infolder=bcftools_samtools_variants infile=NA18507_mem_var.flt-D1000.vcf.gz zgrep -v "^#" ${infolder}/${infile} | \ cut -f 1 | sort | uniq -c | sort -k 2V,2
variant counts for the mem method
84799 chr21
perform QC on the obtained VCF files
As always in NGS, fresh data should always be QC'ed to avoid any surprise and intercept errors and biases asap.
We do the QC on the data stored with bcftools_samtool_variant, the analog QC was run for the bcftools_htslib_variants data and results are available on our server
use bcftools_0.2.0 stats and plot-vcfstats_0.2.0 to perform QC on VCF data
We next perform quality control using the bcftools_0.2.0 stats and plot-vcfstats_0.2.0 commands from bcftools (htslib).
bcftools stats arguments
About: Parses VCF or BCF and produces stats which can be plotted using plot-vcfstats. When two files are given, the program generates separate stats for intersection and the complements. Usage: bcftools stats [options] <A.vcf.gz> [<B.vcf.gz>] Options: -1, --1st-allele-only include only 1st allele at multiallelic sites -c, --collapse <string> treat as identical records with <snps|indels|both|all|some|none>, see man page for details [none] -d, --depth <int,int,int> depth distribution: min,max,bin size [0,500,1] --debug produce verbose per-site and per-sample output -e, --exons <file.gz> tab-delimited file with exons for indel frameshifts (chr,from,to; 1-based, inclusive, bgzip compressed) -f, --apply-filters <list> require at least one of the listed FILTER strings (e.g. "PASS,.") -F, --fasta-ref <file> faidx indexed reference sequence file to determine INDEL context -i, --split-by-ID collect stats for sites with ID separately (known vs novel) -r, --regions <region> restrict to comma-separated list of regions -R, --regions-file <file> restrict to regions listed in a file -s, --samples <list> list of samples for sample stats, "-" to include all samples -S, --samples-file <file> file of samples to include -t, --targets <region> similar to -r but streams rather than index-jumps -T, --targets-file <file> similar to -R but streams rather than index-jumps -u, --user-tstv <TAG[:min:max:n]> collect Ts/Tv stats for any tag using the given binning [0:1,100]
plot-vcfstats arguments
Usage: plot-vcfstats [OPTIONS] file.chk ... plot-vcfstats -p outdir/ file.chk ... Options: -m, --merge Merge vcfstats files to STDOUT, skip plotting. -p, --prefix <path> The output files prefix, add a slash to create new directory. -P, --no-PDF Skip the PDF creation step. -r, --rasterize Rasterize PDF images for fast rendering. -s, --sample-names Use sample names for xticks rather than numeric IDs. -t, --title <string> Identify files by these titles in plots. Can be given multiple times. -T, --main-title <string> Main title for the PDF. -h, -?, --help This help message.
Assessing our full-depth calls with this utility is detailed in the next code-block.
# analysing the chr21 subset of our bwa-calls # take the trainer data infolder=Final_Results/bcftools_samtools_variants # save results in the session result folder outfolder=bcftools_samtools_variants mkdir -p ${outfolder} ############ # aln calls bcftools_0.2.0 stats ${infolder}/chr21_NA18507_aln_var.flt-D1000.vcf.gz > \ ${outfolder}/chr21_aln_var.check plot-vcfstats_0.2.0 -t "NA18507_aln-calls" \ -p ${outfolder}/chr21_aln_plots/ \ ${outfolder}/chr21_aln_var.check ############ # mem calls bcftools_0.2.0 stats ${infolder}/chr21_NA18507_mem_var.flt-D1000.vcf.gz > \ ${outfolder}/chr21_mem_var.check plot-vcfstats_0.2.0 -t "NA18507_mem-calls" \ -p ${outfolder}/chr21_mem_plots/ \ ${outfolder}/chr21_mem_var.check
If we prefer PNG files to PDF we can use ImageMagick already introduced in a former exercise. A custom pdf2png function has been prepared for you.
the pdf2png custom bash function
pdf2png () { # convert PDF to PNG (all pages with suffix) using ImageMagick. # take PDF file from 1 # use PDF file name for PNG with extension swap # or take PNG file name from $2 if [ $# -ge 1 ] then convert -density 300 -depth 8 -quality 100 -resize 25% ${1} ${2:-${1%%.pdf}.png}; else echo "usage: pdf2png <input.pdf> <opt:output.png>" fi }
The summary statistics derived from the VCF file obtained with an identical command from the original hg18 data are reported below.
hg18 variants PDF | |
---|---|
The summary statistics derived from the VCF files indicate the overall quality of the mapping experiment and are reported below.
hg19 variants (aln) PDF | (mem) PDF | |
---|---|---|
The inspection of the resulting *summary.pdf* files shows a ts/tv ratio of 2.00 for the *aln* method against 1.91 for the *mem* method. In human genome, transitions are expected twice more frequent than transversions (ts/tv~2). This suggests that the *aln* method is closer to the expectation and reports more true calls. A ts/tv ratio of 2.4 is often reported as optimal by NGS experts.
download exercise files
Download exercise files here
References:
- ↑ http://www.1000genomes.org
- ↑ http://en.wikipedia.org/wiki/Variant_Call_Format
- ↑
Daniel C Koboldt, Ken Chen, Todd Wylie, David E Larson, Michael D McLellan, Elaine R Mardis, George M Weinstock, Richard K Wilson, Li Ding
VarScan: variant detection in massively parallel sequencing of individual and pooled samples.
Bioinformatics: 2009, 25(17);2283-5
[PubMed:19542151] ##WORLDCAT## [DOI] (I p)
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.4 | NGS Exercise.5b | NGS Exercise.6 ]