NGS-Var Exercise.4
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.3 | NGS-Var Exercise.5 ]
Call variants against the human reference genome (hg19) with [ samtools | bcftools ] or [ samtools | varscan ]
A wikibook page (in construction) is dedicated to variant calling from NGS(<http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/DNA_Variants>).
Several methods are available to convert read mapping information to variant calls. They are all based on local pileup alignment to the reference genome and consensus extraction but different callers use different strategies to limit the effect of wrongly mapped or low quality read sequences on the final result. We used here samtools with bcftools or varscan to call SNVs and small indels rather than the more performant but more difficult to use GATK competitor (see NGS_Exercise.5_GATK).
Contents
Preliminary info
quick intro on VCF format
Since the expansion of the 1000 genome project[1], the Variant Call Format has become more and more popular and is today the default format to represent sequence variation. VCF is a tabular text format that provides rich information about each position different from the reference genome. It also includes different scores obtained during sequencing, alignment, and calling to allow quality filtering as well as added sequence annotations to allow annotation-driven filtering. Various tools have been developed to manipulate VCF data and are exemplified during this session.
An example of VCF data is provided here as a primer, users will get more information in NGS-formats or from the VCF documentation <http://en.wikipedia.org/wiki/Variant_Call_Format>[2].
Recent efforts have been made to add the capability to call complex structural variants in VCF data, please refer to <https://samtools.github.io/hts-specs/VCFv4.3.pdf>[3] describing the current VCF 4.3 revision.
About bgzip-compressed VCF data and indexing
Similarly to SAM vs BAM and in order to speed VCF compatible tools, raw VCF data needs to be sorted and indexed. We created for this purpose a simple 'bash custom function' named vcf2index that is 'defined (declared)' on your training laptop. The function will be applied below to generate binary compressed gz versions and indexes for each VCF result.
the vcf2index custom bash function
vcf2index { # takes a vcf file as argument ($1) # keeps header and sorts remaining lines # compresses sorted file with bgzip # indexes it with tabix for vcftools applications # author: Stephane Plaisance # version: 1.0 if [ $# -eq 1 ]; then title=$(basename "$1" ".vcf") ( grep ^"#" $1 | perl -pi -e "s/$title.*$/$title/"; grep -v ^"#" $1 | sort -k 1V,1 -k 2n,2 ) | bgzip -c > $1".gz" tabix -p vcf $1".gz" else echo "usage: vcf2index <vcf-file>" fi }
Call variants with samtools and bcftools (both using htslib)
This method is historically the longest in use and is still valid for most applications.
Samtools in recent versions has replaces pileup with mpileup. Keep this in mind when you read online documentation as functionalities have significantly changed with mpileup. During the workflow presented below, several steps are undergone starting with mpileup of reads, assembly of the results with bcftools to produce vcf data, and finally compress the obtained VCF (all in one smooth piped workflow to avoid producing intermediate result files and save disk space).
samtools 1.3 (htslib) arguments
Program: samtools (Tools for alignments in the SAM format) Version: 1.3 (using htslib 1.3) Usage: samtools <command> [options] Commands: -- Indexing dict create a sequence dictionary file faidx index/extract FASTA index index alignment -- Editing calmd recalculate MD/NM tags and '=' bases fixmate fix mate information reheader replace BAM header rmdup remove PCR duplicates targetcut cut fosmid regions (for fosmid pool only) addreplacerg adds or replaces RG tags -- File operations collate shuffle and group alignments by name cat concatenate BAMs merge merge sorted alignments mpileup multi-way pileup sort sort alignment file split splits a file by read group quickcheck quickly check if SAM/BAM/CRAM file appears intact fastq converts a BAM to a FASTQ fasta converts a BAM to a FASTA -- Statistics bedcov read depth per BED region depth compute the depth flagstat simple stats idxstats BAM index stats phase phase heterozygotes stats generate stats (former bamcheck) -- Viewing flags explain BAM flags tview text alignment viewer view SAM<->BAM<->CRAM conversion depad convert padded BAM to unpadded BAM
bcftools 1.3 (htslib) arguments
Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs) Version: 1.1 (using htslib 1.1) Usage: bcftools <command> <argument> Commands: -- Indexing index index VCF/BCF files -- VCF/BCF manipulation annotate annotate and edit VCF/BCF files concat concatenate VCF/BCF files from the same set of samples convert convert VCF/BCF files to different formats and back isec intersections of VCF/BCF files merge merge VCF/BCF files files from non-overlapping sample sets norm left-align and normalize indels plugin user-defined plugins query transform VCF/BCF into user-defined formats reheader modify VCF/BCF header, change sample names view VCF/BCF conversion, view, subset and filter VCF/BCF files -- VCF/BCF analysis call SNP/indel calling filter filter VCF/BCF files using fixed thresholds gtcheck check sample concordance, detect sample swaps and contamination roh identify runs of autozygosity (HMM) stats produce VCF/BCF stats Most commands accept VCF, bgzipped VCF, and BCF with the file type detected automatically even when streaming from a pipe. Indexed VCF and BCF will work in all situations. Un-indexed VCF and BCF and streams will work in most but not all situations.
bcftools call (htslib) arguments
About: SNP/indel variant calling from VCF/BCF. To be used in conjunction with samtools mpileup. This command replaces the former "bcftools view" caller. Some of the original functionality has been temporarily lost in the process of transition to htslib, but will be added back on popular demand. The original calling model can be invoked with the -c option. Usage: bcftools call [options] <in.vcf.gz> File format options: -o, --output <file> write output to a file [standard output] -O, --output-type <b|u|z|v> output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v] -r, --regions <region> restrict to comma-separated list of regions -R, --regions-file <file> restrict to regions listed in a file -s, --samples <list> list of samples to include [all samples] -S, --samples-file <file> PED file or a file with optional second column for ploidy (0, 1 or 2) [all samples] -t, --targets <region> similar to -r but streams rather than index-jumps -T, --targets-file <file> similar to -R but streams rather than index-jumps Input/output options: -A, --keep-alts keep all possible alternate alleles at variant sites -f, --format-fields <list> output format fields: GQ,GP (lowercase allowed) [] -g, --gvcf <minDP> output gVCF blocks of homozygous REF calls -i, --insert-missed output also sites missed by mpileup but present in -T -M, --keep-masked-ref keep sites with masked reference allele (REF=N) -V, --skip-variants <type> skip indels/snps -v, --variants-only output variant sites only Consensus/variant calling options: -c, --consensus-caller the original calling method (conflicts with -m) -C, --constrain <str> one of: alleles, trio (see manual) -m, --multiallelic-caller alternative model for multiallelic and rare-variant calling (conflicts with -c) -n, --novel-rate <float>,[...] likelihood of novel mutation for constrained trio calling, see man page for details [1e-8,1e-9,1e-9] -p, --pval-threshold <float> variant if P(ref|D)<FLOAT with -c [0.5] -P, --prior <float> mutation rate [1e-3] -X, --chromosome-X haploid output for male samples (requires PED file with -s) -Y, --chromosome-Y haploid output for males and skips females (requires PED file with -s)
bcftools view (htslib) arguments
About: VCF/BCF conversion, view, subset and filter VCF/BCF files. Usage: bcftools view [options] <in.vcf.gz> [region1 [...]] Output options: -G, --drop-genotypes drop individual genotype information (after subsetting if -s option set) -h/H, --header-only/--no-header print the header only/suppress the header in VCF output -l, --compression-level [0-9] compression level: 0 uncompressed, 1 best speed, 9 best compression [-1] -o, --output-file <file> output file name [stdout] -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v] -r, --regions <region> restrict to comma-separated list of regions -R, --regions-file <file> restrict to regions listed in a file -t, --targets [^]<region> similar to -r but streams rather than index-jumps. Exclude regions with "^" prefix -T, --targets-file [^]<file> similar to -R but streams rather than index-jumps. Exclude regions with "^" prefix Subset options: -a, --trim-alt-alleles trim alternate alleles not seen in the subset -I, --no-update do not (re)calculate INFO fields for the subset (currently INFO/AC and INFO/AN) -s, --samples [^]<list> comma separated list of samples to include (or exclude with "^" prefix) -S, --samples-file [^]<file> file of samples to include (or exclude with "^" prefix) --force-samples only warn about unknown subset samples Filter options: -c/C, --min-ac/--max-ac <int>[:<type>] minimum/maximum count for non-reference (nref), 1st alternate (alt1) or minor (minor) alleles [nref] -f, --apply-filters <list> require at least one of the listed FILTER strings (e.g. "PASS,.") -g, --genotype [^]<hom|het|miss> require one or more hom/het/missing genotype or, if prefixed with "^", exclude sites with hom/het/missing genotypes -i/e, --include/--exclude <expr> select/exclude sites for which the expression is true (see man page for details) -k/n, --known/--novel select known/novel sites only (ID is not/is '.') -m/M, --min-alleles/--max-alleles <int> minimum/maximum number of alleles listed in REF and ALT (e.g. -m2 -M2 for biallelic sites) -p/P, --phased/--exclude-phased select/exclude sites where all samples are phased -q/Q, --min-af/--max-af <float>[:<type>] minimum/maximum frequency for non-reference (nref), 1st alternate (alt1) or minor (minor) alleles [nref] -u/U, --uncalled/--exclude-uncalled select/exclude sites without a called genotype -v/V, --types/--exclude-types <list> select/exclude comma-separated list of variant types: snps,indels,mnps,other [null] -x/X, --private/--exclude-private select/exclude sites where the non-reference alleles are exclusive (private) to the subset samples
bcftools filter (htslib) arguments
About: Apply fixed-threshold filters. Usage: bcftools filter [options] <in.vcf.gz> Options: -e, --exclude <expr> exclude sites for which the expression is true (see man page for details) -g, --SnpGap <int> filter SNPs within <int> base pairs of an indel -G, --IndelGap <int> filter clusters of indels separated by <int> or fewer base pairs allowing only one to pass -i, --include <expr> include only sites for which the expression is true (see man page for details -m, --mode [+x] "+": do not replace but add to existing FILTER; "x": reset filters at sites which pass -o, --output <file> write output to a file [standard output] -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v] -r, --regions <region> restrict to comma-separated list of regions -R, --regions-file <file> restrict to regions listed in a file -s, --soft-filter <string> annotate FILTER column with <string> or unique filter name ("Filter%d") made up by the program ("+") -S, --set-GTs <.|0> set genotypes of failed samples to missing (.) or ref (0) -t, --targets <region> similar to -r but streams rather than index-jumps -T, --targets-file <file> similar to -R but streams rather than index-jumps
samtools vcfutils.pl commands
Usage: vcfutils.pl <command> [<arguments>] Command: subsam get a subset of samples listsam list the samples fillac fill the allele count field qstats SNP stats stratified by QUAL hapmap2vcf convert the hapmap format to VCF ucscsnp2vcf convert UCSC SNP SQL dump to VCF varFilter filtering short variants (*) vcf2fq VCF->fastq (**) Notes: Commands with description ending with (*) may need bcftools specific annotations. ## vcfutils.pl varFilter Usage: vcfutils.pl varFilter [options] <in.vcf> Options: -Q INT minimum RMS mapping quality for SNPs [10] -d INT minimum read depth [2] -D INT maximum read depth [10000000] -a INT minimum number of alternate bases [2] -w INT SNP within INT bp around a gap to be filtered [3] -W INT window size for filtering adjacent gaps [10] -1 FLOAT min P-value for strand bias (given PV4) [0.0001] -2 FLOAT min P-value for baseQ bias [1e-100] -3 FLOAT min P-value for mapQ bias [0] -4 FLOAT min P-value for end distance bias [0.0001] -e FLOAT min P-value for HWE (plus F<0) [0.0001] -p print filtered variants Note: Some of the filters rely on annotations generated by SAMtools/BCFtools.
Given the observed average genome coverage of ~40x, very high depth loci can only originate from copy number effects and should not be trusted. The vcfutils.pl varFilter -D1000 filtering step ensures that read depth above 1000 will not be taken into account to call variants (excessive piling up of reads is indicative of poorly mapped region). Additional filtering can be performed based on quality scores or several other useful parameters listed below. Please feel free to play with these back in the lab.
#! /usr/bin/env bash # script bcftools_call-variants.sh # required # samtools v1.x (htslib) # bcftools v1.x (htslib) # vcftools 0.1.14 # vcf2index custom function # if not done yet, source the user-defined .myfunctions (including vcf2index) # . $HOME/.myfunctions vcf2index { # takes a vcf file as argument ($1) # keeps header and sorts remaining lines # compresses sorted file with bgzip # indexes it with tabix for vcftools applications # author: Stephane Plaisance # version: 1.0 if [ $# -eq 1 ] then title=$(basename "$1" ".vcf") ( grep ^"#" $1 | perl -pi -e "s/$title.*$/$title/"; grep -v ^"#" $1 | sort -k 1V,1 -k 2n,2 ) | bgzip -c > $1".gz" tabix -p vcf $1".gz" else echo "usage: vcf2index <vcf-file>" fi } # we call from the full mapping results for chr21 # calling from the low coverage training data would not give confident calls infolder=hg19_bwa-mapping infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam name=${infile%%.bam} outfolder=bcftools_htslib_variants mkdir -p ${outfolder2} ref=ref/HiSeq_UCSC_hg19.fa # call variants from pileup, filter and convert calls to VCF format # samtools mpileup arguments # -u generate uncompress BCF output (to feed the pipe) # -f reference genome in fasta format related to the BWA index # bcftools call arguments # -v output potential variant sites only (force -c) # -c SNP calling (force -e) # -O u for bcf-uncompressed output # exclude calls where more than 1000 reads support a variation ## REM: the next command is made of two parts contained between '(...)' with separate error logs (2>...) ## only if (part-1) succeeds ('&&') => then (part-2) is started. ## part-1 creates a raw BCF file ## part-2 processes the BCF data, filters it, and creates a VCF dataset (samtools mpileup -uf ${ref} ${infolder}/${infile} | \ bcftools call -vc -O u > ${outfolder}/NA18507_var_bcftools.raw.bcf \ 2>${outfolder}/samtools_mpileup_bcftools_raw.err) && \ (bcftools view ${outfolder}/NA18507_var_bcftools.raw.bcf | \ vcfutils varFilter -D1000 > \ ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf \ 2>${outfolder}/samtools_mpileup_bcftools_filter.err) # Sort, compress, and index the variant VCF files vcf2index ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf && \ rm ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf # extract the chr21 subset of calls vcftools --gzvcf ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf.gz \ --chr chr21 \ --out ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000 \ --recode # recoding introduces the word recode at the end of the file-name # remove the recode tag from the name to simplify mv ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.recode.vcf \ ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf # sort, compress and index using our custom 'vcf2index' function) vcf2index ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf && \ rm ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf echo
review germline SNVs and Indels calls
> cat chr21_NA18507_var_bcftools.flt-D1000.log VCFtools - 0.1.14 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --gzvcf varscan2_variants/NA18507_mpileup2cns.vcf.gz --chr chr21 --out varscan2_variants/chr21_NA18507_varscan --recode Using zlib version: 1.2.8 After filtering, kept 1 out of 1 Individuals Outputting VCF file... After filtering, kept 81460 out of a possible 84224 Sites Run Time = 2.00 seconds # count variants per chromosome infolder=bcftools_htslib_variants infile=NA18507_var_bcftools.flt-D1000.vcf.gz zgrep -v "^#" ${infolder}/${infile} | \ cut -f 1 | sort | uniq -c | sort -k 2V,2 .... 84864 chr21 ....
Call variants with samtools and varscan
While Samtools calls variants using a probabilistic approach, Varscan2 [4] instead uses a more basic approach of filtering out what users consider as background. Several cutoffs can be defined to filter-out 'bad' mappings and keep 'good' ones for calling.
default varscan2 built-in filtering parameters are:
- --min-coverage Minimum read depth at a position to make a call [8]
- --min-reads2 Minimum supporting reads at a position to call variants [2]
- --min-var-freq Minimum variant allele frequency threshold [0.20]
- --min-avg-qual Minimum base quality at a position to count a read [15]
- --min-freq-for-hom Minimum frequency to call homozygote [0.75]
- --p-value Default p-value threshold for calling variants [0.01]
- --strand-filter Ignore variants with >90% support on one strand [1]
VarScan2 comes in different flavours; a method for somatic variants, another for somatic CNV, and a more often used method for germline variants (used today).
Three VarScan subcommands will invoke the germline variant calling model
- mpileup2snp - calls single nucleotide polymorphisms (SNPs)
- mpileup2indel - calls insertions and deletions (indels)
- mpileup2cns - calls a consensus genotype (reference, SNP, or indel)
These work with single-sample and multi-sample mpileup input. The first two (mpileup2snp and mpileup2indel) report *only* positions at which a variant of the given type (SNP and indel) was called. The third command (mpileup2cns) reports all positions that met the miniumum coverage, or (with the -v parameter), all positions at which a SNP or an indel was called.
read more in the varscan2 documentation pages[5]
The following bash script was prepared to perform variant calling and extract meaningful results.
VarScan.jar mpileup2cns command parameters
USAGE: java -jar VarScan.jar mpileup2cns [pileup file] OPTIONS mpileup file - The SAMtools mpileup file OPTIONS: --min-coverage Minimum read depth at a position to make a call [8] --min-reads2 Minimum supporting reads at a position to call variants [2] --min-avg-qual Minimum base quality at a position to count a read [15] --min-var-freq Minimum variant allele frequency threshold [0.01] --min-freq-for-hom Minimum frequency to call homozygote [0.75] --p-value Default p-value threshold for calling variants [99e-02] --strand-filter Ignore variants with >90% support on one strand [1] --output-vcf If set to 1, outputs in VCF format --vcf-sample-list For VCF output, a list of sample names in order, one per line --variants Report only variant (SNP/indel) positions [0]
#! /usr/bin/env bash # script varscan2-mpileup2cns_call-variants.sh # required # samtools version: 1.x (htslib) # varscan2 version 2.4.1 # vcftools 0.1.14 # vcfutils.pl (samtools) # vcf2index custom function # make sure you have defined VARSCAN in your .bashrc # export VARSCAN=<edit here the path to VarScan.v2.4.1.jar> # if not done yet, source user-defined .myfunctions (including vcf2index) # . $HOME/.myfunctions # vcf2index is reproduced here for training vcf2index { # takes a vcf file as argument ($1) # keeps header and sorts remaining lines # compresses sorted file with bgzip # indexes it with tabix for vcftools applications # author: Stephane Plaisance # version: 1.0 if [ $# -eq 1 ]; then title=$(basename "$1" ".vcf") ( grep ^"#" $1 | perl -pi -e "s/$title.*$/$title/"; grep -v ^"#" $1 | sort -k 1V,1 -k 2n,2 ) | bgzip -c > $1".gz" tabix -p vcf $1".gz" else echo "usage: vcf2index <vcf-file>" fi } # we call from the full mapping results for chr21 # calling from the low coverage training data would not give confident calls infolder=hg19_bwa-mapping infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam outfolder=varscan2_variants mkdir -p ${outfolder} ref=ref/HiSeq_UCSC_hg19.fa # call SNV and InDels variants from samtools mpileup ## limit allocated RAM maxmem="4G" samtools mpileup -f ${ref} ${infolder}/${infile} | \ java -Xmx${maxmem} -jar $VARSCAN/VarScan.v2.4.1.jar mpileup2cns \ --variants --output-vcf 1 --p-value 0.01 \ >${outfolder}/NA18507_mpileup2cns.vcf \ 2>${outfolder}/varscan2_mpileup2cns.err ## compress and index result bgzip -c ${outfolder}/NA18507_mpileup2cns.vcf \ > ${outfolder}/NA18507_mpileup2cns.vcf.gz && tabix -p vcf ${outfolder}/NA18507_mpileup2cns.vcf.gz # extract the chr21 subset of calls vcftools --gzvcf ${outfolder}/NA18507_mpileup2cns.vcf.gz \ --chr chr21 \ --out ${outfolder}/chr21_NA18507_varscan \ --recode # recoding introduces the word recode at the end of the file-name # remove the recode tag from the name mv ${outfolder}/chr21_NA18507_varscan.recode.vcf \ ${outfolder}/chr21_NA18507_varscan.vcf # sort, compress and index using our custom 'vcf2index' function) vcf2index ${outfolder}/chr21_NA18507_varscan.vcf && \ rm ${outfolder}/chr21_NA18507_varscan.vcf echo
review germline SNVs and Indels calls
> cat varscan2_mpileup2cns.err Only variants will be reported Min coverage: 8 Min reads2: 2 Min var freq: 0.2 Min avg qual: 15 P-value thresh: 0.01 Input stream not ready, waiting for 5 seconds... Reading input from STDIN 35410265 bases in pileup file 85777 variant positions (74690 SNP, 11087 indel) 1553 were failed by the strand-filter 84224 variant positions reported (73229 SNP, 10995 indel)
Perform QC on the obtained VCF files using bcftools
As always in NGS, fresh data should always be QC'ed to avoid any surprise and intercept errors and biases asap.
We next perform quality control using the bcftools stats and plot-vcfstats commands from bcftools 1.x (htslib).
bcftools 1.x commands
Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs) Version: 1.3 (using htslib 1.3) Usage: bcftools [--version|--version-only] [--help] <command> <argument> Commands: -- Indexing index index VCF/BCF files -- VCF/BCF manipulation annotate annotate and edit VCF/BCF files concat concatenate VCF/BCF files from the same set of samples convert convert VCF/BCF files to different formats and back isec intersections of VCF/BCF files merge merge VCF/BCF files files from non-overlapping sample sets norm left-align and normalize indels plugin user-defined plugins query transform VCF/BCF into user-defined formats reheader modify VCF/BCF header, change sample names view VCF/BCF conversion, view, subset and filter VCF/BCF files -- VCF/BCF analysis call SNP/indel calling consensus create consensus sequence by applying VCF variants cnv HMM CNV calling filter filter VCF/BCF files using fixed thresholds gtcheck check sample concordance, detect sample swaps and contamination roh identify runs of autozygosity (HMM) stats produce VCF/BCF stats Most commands accept VCF, bgzipped VCF, and BCF with the file type detected automatically even when streaming from a pipe. Indexed VCF and BCF will work in all situations. Un-indexed VCF and BCF and streams will work in most but not all situations.
bcftools stats arguments
About: Parses VCF or BCF and produces stats which can be plotted using plot-vcfstats. When two files are given, the program generates separate stats for intersection and the complements. By default only sites are compared, -s/-S must given to include also sample columns. Usage: bcftools stats [options] <A.vcf.gz> [<B.vcf.gz>] Options: -1, --1st-allele-only include only 1st allele at multiallelic sites -c, --collapse <string> treat as identical records with <snps|indels|both|all|some|none>, see man page for details [none] -d, --depth <int,int,int> depth distribution: min,max,bin size [0,500,1] -e, --exclude <expr> exclude sites for which the expression is true (see man page for details) -E, --exons <file.gz> tab-delimited file with exons for indel frameshifts (chr,from,to; 1-based, inclusive, bgzip compressed) -f, --apply-filters <list> require at least one of the listed FILTER strings (e.g. "PASS,.") -F, --fasta-ref <file> faidx indexed reference sequence file to determine INDEL context -i, --include <expr> select sites for which the expression is true (see man page for details) -I, --split-by-ID collect stats for sites with ID separately (known vs novel) -r, --regions <region> restrict to comma-separated list of regions -R, --regions-file <file> restrict to regions listed in a file -s, --samples <list> list of samples for sample stats, "-" to include all samples -S, --samples-file <file> file of samples to include -t, --targets <region> similar to -r but streams rather than index-jumps -T, --targets-file <file> similar to -R but streams rather than index-jumps -u, --user-tstv <TAG[:min:max:n]> collect Ts/Tv stats for any tag using the given binning [0:1:100] -v, --verbose produce verbose per-site and per-sample output
plot-vcfstats arguments
Usage: plot-vcfstats [OPTIONS] file.chk ... plot-vcfstats -p outdir/ file.chk ... Options: -m, --merge Merge vcfstats files to STDOUT, skip plotting. -p, --prefix <path> The output files prefix, add a slash to create new directory. -P, --no-PDF Skip the PDF creation step. -r, --rasterize Rasterize PDF images for fast rendering. -s, --sample-names Use sample names for xticks rather than numeric IDs. -t, --title <string> Identify files by these titles in plots. Can be given multiple times. -T, --main-title <string> Main title for the PDF. -h, -?, --help This help message.
Assessing our full-depth calls with this utility is detailed in the next code-block.
# analysing the chr21 subset of our bwa-calls # bcftools metrics infolder=bcftools_htslib_variants bcftools stats ${infolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz > \ ${infolder}/chr21_NA18507_var_bcftools.flt-D1000_var.check plot-vcfstats -t "NA18507_bcftools-calls" \ -p ${infolder}/chr21_bcftools_plots/ \ ${infolder}/chr21_NA18507_var_bcftools.flt-D1000_var.check # VarScan metrics infolder=varscan2_variants bcftools stats ${infolder}/chr21_NA18507_varscan.vcf.gz > \ ${infolder}/chr21_varscan_var.check plot-vcfstats -t "NA18507_varscan-calls" \ -p ${infolder}/chr21_varscan_plots/ \ ${infolder}/chr21_varscan_var.check
If we prefer PNG files to PDF we can use ImageMagick already introduced in a former exercise. A custom pdf2png function has been prepared for you.
the pdf2png custom bash function
pdf2png () { # convert PDF to PNG (all pages with suffix) # take PDF file from 1 # use PDF file name for PNG with extension swap # or take PNG file name from $2 if [ $# -ge 1 ] then convert -density 300 -depth 8 -quality 100 -resize 25% ${1} ${2:-${1%%.pdf}.png}; else echo "usage: pdf2png <input.pdf> <opt:output.png>" fi }
The summary statistics derived from the VCF files indicate the overall quality of the mapping experiment and are reported below.
hg19 variants (bcftools) PDF | (varscan2) PDF |
---|---|
The inspection of the resulting *summary.pdf* files shows a ts/tv ratio of 2.04 for the *bcftools* method against 2.05 for the *varscan2* method. In human genome, transitions are expected twice more frequently than transversions (ts/tv~2) and ts/tv ratio of 2.4 is often reported as optimal by NGS experts. The obtained data is in range.
download exercise files
Download exercise files here
References:
- ↑ http://www.1000genomes.org
- ↑ http://en.wikipedia.org/wiki/Variant_Call_Format
- ↑ https://samtools.github.io/hts-specs/VCFv4.3.pdf
- ↑
Daniel C Koboldt, Ken Chen, Todd Wylie, David E Larson, Michael D McLellan, Elaine R Mardis, George M Weinstock, Richard K Wilson, Li Ding
VarScan: variant detection in massively parallel sequencing of individual and pooled samples.
Bioinformatics: 2009, 25(17);2283-5
[PubMed:19542151] ##WORLDCAT## [DOI] (I p) - ↑ http://dkoboldt.github.io/varscan/using-varscan.html
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.3 | NGS-Var Exercise.5 ]