NGS-Var Exercise.4

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.3 | NGS-Var Exercise.5 ]


Call variants against the human reference genome (hg19) with [ samtools | bcftools ] or [ samtools | varscan ]


ex04_wf.png

A wikibook page (in construction) is dedicated to variant calling from NGS(<http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/DNA_Variants>).

Several methods are available to convert read mapping information to variant calls. They are all based on local pileup alignment to the reference genome and consensus extraction but different callers use different strategies to limit the effect of wrongly mapped or low quality read sequences on the final result. We used here samtools with bcftools or varscan to call SNVs and small indels rather than the more performant but more difficult to use GATK competitor (see NGS_Exercise.5_GATK).

Preliminary info

quick intro on VCF format

Since the expansion of the 1000 genome project[1], the Variant Call Format has become more and more popular and is today the default format to represent sequence variation. VCF is a tabular text format that provides rich information about each position different from the reference genome. It also includes different scores obtained during sequencing, alignment, and calling to allow quality filtering as well as added sequence annotations to allow annotation-driven filtering. Various tools have been developed to manipulate VCF data and are exemplified during this session.

An example of VCF data is provided here as a primer, users will get more information in NGS-formats or from the VCF documentation <http://en.wikipedia.org/wiki/Variant_Call_Format>[2].

Recent efforts have been made to add the capability to call complex structural variants in VCF data, please refer to <https://samtools.github.io/hts-specs/VCFv4.3.pdf>[3] describing the current VCF 4.3 revision.

About bgzip-compressed VCF data and indexing

Similarly to SAM vs BAM and in order to speed VCF compatible tools, raw VCF data needs to be sorted and indexed. We created for this purpose a simple 'bash custom function' named vcf2index that is 'defined (declared)' on your training laptop. The function will be applied below to generate binary compressed gz versions and indexes for each VCF result.

the vcf2index custom bash function

vcf2index {
# takes a vcf file as argument ($1)
# keeps header and sorts remaining lines
# compresses sorted file with bgzip
# indexes it with tabix for vcftools applications
# author: Stephane Plaisance
# version: 1.0
if [ $# -eq 1 ]; then
	title=$(basename "$1" ".vcf")
	( grep ^"#" $1 | perl -pi -e "s/$title.*$/$title/";
	grep -v ^"#" $1 | sort -k 1V,1 -k 2n,2 ) | bgzip -c > $1".gz"
	tabix -p vcf $1".gz"
else
	echo "usage: vcf2index <vcf-file>"
fi
}

 

Call variants with samtools and bcftools (both using htslib)

This method is historically the longest in use and is still valid for most applications.

Samtools in recent versions has replaces pileup with mpileup. Keep this in mind when you read online documentation as functionalities have significantly changed with mpileup. During the workflow presented below, several steps are undergone starting with mpileup of reads, assembly of the results with bcftools to produce vcf data, and finally compress the obtained VCF (all in one smooth piped workflow to avoid producing intermediate result files and save disk space).

samtools 1.3 (htslib) arguments

Program: samtools (Tools for alignments in the SAM format)
Version: 1.3 (using htslib 1.3)
 
Usage:   samtools <command> [options]
 
Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     index          index alignment
 
  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     rmdup          remove PCR duplicates
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
 
  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA
 
  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)
 
  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

bcftools 1.3 (htslib) arguments

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
Version: 1.1 (using htslib 1.1)
 
Usage:   bcftools <command> <argument>
 
Commands:
 
 -- Indexing
    index        index VCF/BCF files
 
 -- VCF/BCF manipulation
    annotate     annotate and edit VCF/BCF files
    concat       concatenate VCF/BCF files from the same set of samples
    convert      convert VCF/BCF files to different formats and back
    isec         intersections of VCF/BCF files
    merge        merge VCF/BCF files files from non-overlapping sample sets
    norm         left-align and normalize indels
    plugin       user-defined plugins
    query        transform VCF/BCF into user-defined formats
    reheader     modify VCF/BCF header, change sample names
    view         VCF/BCF conversion, view, subset and filter VCF/BCF files
 
 -- VCF/BCF analysis
    call         SNP/indel calling
    filter       filter VCF/BCF files using fixed thresholds
    gtcheck      check sample concordance, detect sample swaps and contamination
    roh          identify runs of autozygosity (HMM)
    stats        produce VCF/BCF stats
 
 Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
 automatically even when streaming from a pipe. Indexed VCF and BCF will work
 in all situations. Un-indexed VCF and BCF and streams will work in most but
 not all situations.

bcftools call (htslib) arguments

About:   SNP/indel variant calling from VCF/BCF. To be used in conjunction with samtools mpileup.
         This command replaces the former "bcftools view" caller. Some of the original
         functionality has been temporarily lost in the process of transition to htslib,
         but will be added back on popular demand. The original calling model can be
         invoked with the -c option.
Usage:   bcftools call [options] <in.vcf.gz>
 
File format options:
   -o, --output <file>             write output to a file [standard output]
   -O, --output-type <b|u|z|v>     output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
   -r, --regions <region>          restrict to comma-separated list of regions
   -R, --regions-file <file>       restrict to regions listed in a file
   -s, --samples <list>            list of samples to include [all samples]
   -S, --samples-file <file>       PED file or a file with optional second column for ploidy (0, 1 or 2) [all samples]
   -t, --targets <region>          similar to -r but streams rather than index-jumps
   -T, --targets-file <file>       similar to -R but streams rather than index-jumps
 
Input/output options:
   -A, --keep-alts                 keep all possible alternate alleles at variant sites
   -f, --format-fields <list>      output format fields: GQ,GP (lowercase allowed) []
   -g, --gvcf <minDP>              output gVCF blocks of homozygous REF calls
   -i, --insert-missed             output also sites missed by mpileup but present in -T
   -M, --keep-masked-ref           keep sites with masked reference allele (REF=N)
   -V, --skip-variants <type>      skip indels/snps
   -v, --variants-only             output variant sites only
 
Consensus/variant calling options:
   -c, --consensus-caller          the original calling method (conflicts with -m)
   -C, --constrain <str>           one of: alleles, trio (see manual)
   -m, --multiallelic-caller       alternative model for multiallelic and rare-variant calling (conflicts with -c)
   -n, --novel-rate <float>,[...]  likelihood of novel mutation for constrained trio calling, see man page for details [1e-8,1e-9,1e-9]
   -p, --pval-threshold <float>    variant if P(ref|D)<FLOAT with -c [0.5]
   -P, --prior <float>             mutation rate [1e-3]
   -X, --chromosome-X              haploid output for male samples (requires PED file with -s)
   -Y, --chromosome-Y              haploid output for males and skips females (requires PED file with -s)

bcftools view (htslib) arguments

About:   VCF/BCF conversion, view, subset and filter VCF/BCF files.
Usage:   bcftools view [options] <in.vcf.gz> [region1 [...]]
 
Output options:
    -G,   --drop-genotypes              drop individual genotype information (after subsetting if -s option set)
    -h/H, --header-only/--no-header     print the header only/suppress the header in VCF output
    -l,   --compression-level [0-9]     compression level: 0 uncompressed, 1 best speed, 9 best compression [-1]
    -o,   --output-file <file>          output file name [stdout]
    -O,   --output-type <b|u|z|v>       b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
    -r, --regions <region>              restrict to comma-separated list of regions
    -R, --regions-file <file>           restrict to regions listed in a file
    -t, --targets [^]<region>           similar to -r but streams rather than index-jumps. Exclude regions with "^" prefix
    -T, --targets-file [^]<file>        similar to -R but streams rather than index-jumps. Exclude regions with "^" prefix
 
Subset options:
    -a, --trim-alt-alleles        trim alternate alleles not seen in the subset
    -I, --no-update               do not (re)calculate INFO fields for the subset (currently INFO/AC and INFO/AN)
    -s, --samples [^]<list>       comma separated list of samples to include (or exclude with "^" prefix)
    -S, --samples-file [^]<file>  file of samples to include (or exclude with "^" prefix)
        --force-samples           only warn about unknown subset samples
 
Filter options:
    -c/C, --min-ac/--max-ac <int>[:<type>]      minimum/maximum count for non-reference (nref), 1st alternate (alt1) or minor (minor) alleles [nref]
    -f,   --apply-filters <list>                require at least one of the listed FILTER strings (e.g. "PASS,.")
    -g,   --genotype [^]<hom|het|miss>          require one or more hom/het/missing genotype or, if prefixed with "^", exclude sites with hom/het/missing genotypes
    -i/e, --include/--exclude <expr>            select/exclude sites for which the expression is true (see man page for details)
    -k/n, --known/--novel                       select known/novel sites only (ID is not/is '.')
    -m/M, --min-alleles/--max-alleles <int>     minimum/maximum number of alleles listed in REF and ALT (e.g. -m2 -M2 for biallelic sites)
    -p/P, --phased/--exclude-phased             select/exclude sites where all samples are phased
    -q/Q, --min-af/--max-af <float>[:<type>]    minimum/maximum frequency for non-reference (nref), 1st alternate (alt1) or minor (minor) alleles [nref]
    -u/U, --uncalled/--exclude-uncalled         select/exclude sites without a called genotype
    -v/V, --types/--exclude-types <list>        select/exclude comma-separated list of variant types: snps,indels,mnps,other [null]
    -x/X, --private/--exclude-private           select/exclude sites where the non-reference alleles are exclusive (private) to the subset samples

bcftools filter (htslib) arguments

About:   Apply fixed-threshold filters.
Usage:   bcftools filter [options] <in.vcf.gz>
 
Options:
    -e, --exclude <expr>          exclude sites for which the expression is true (see man page for details)
    -g, --SnpGap <int>            filter SNPs within <int> base pairs of an indel
    -G, --IndelGap <int>          filter clusters of indels separated by <int> or fewer base pairs allowing only one to pass
    -i, --include <expr>          include only sites for which the expression is true (see man page for details
    -m, --mode [+x]               "+": do not replace but add to existing FILTER; "x": reset filters at sites which pass
    -o, --output <file>           write output to a file [standard output]
    -O, --output-type <b|u|z|v>   b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
    -r, --regions <region>        restrict to comma-separated list of regions
    -R, --regions-file <file>     restrict to regions listed in a file
    -s, --soft-filter <string>    annotate FILTER column with <string> or unique filter name ("Filter%d") made up by the program ("+")
    -S, --set-GTs <.|0>           set genotypes of failed samples to missing (.) or ref (0)
    -t, --targets <region>        similar to -r but streams rather than index-jumps
    -T, --targets-file <file>     similar to -R but streams rather than index-jumps

samtools vcfutils.pl commands

Usage:   vcfutils.pl <command> [<arguments>]
 
Command: subsam       get a subset of samples
         listsam      list the samples
         fillac       fill the allele count field
         qstats       SNP stats stratified by QUAL
 
         hapmap2vcf   convert the hapmap format to VCF
         ucscsnp2vcf  convert UCSC SNP SQL dump to VCF
 
         varFilter    filtering short variants (*)
         vcf2fq       VCF->fastq (**)
 
Notes: Commands with description ending with (*) may need bcftools
       specific annotations.
 
## vcfutils.pl varFilter
Usage:   vcfutils.pl varFilter [options] <in.vcf>
 
Options: -Q INT    minimum RMS mapping quality for SNPs [10]
         -d INT    minimum read depth [2]
         -D INT    maximum read depth [10000000]
         -a INT    minimum number of alternate bases [2]
         -w INT    SNP within INT bp around a gap to be filtered [3]
         -W INT    window size for filtering adjacent gaps [10]
         -1 FLOAT  min P-value for strand bias (given PV4) [0.0001]
         -2 FLOAT  min P-value for baseQ bias [1e-100]
         -3 FLOAT  min P-value for mapQ bias [0]
         -4 FLOAT  min P-value for end distance bias [0.0001]
         -e FLOAT  min P-value for HWE (plus F<0) [0.0001]
         -p        print filtered variants
 
Note: Some of the filters rely on annotations generated by SAMtools/BCFtools.

 

Given the observed average genome coverage of ~40x, very high depth loci can only originate from copy number effects and should not be trusted. The vcfutils.pl varFilter -D1000 filtering step ensures that read depth above 1000 will not be taken into account to call variants (excessive piling up of reads is indicative of poorly mapped region). Additional filtering can be performed based on quality scores or several other useful parameters listed below. Please feel free to play with these back in the lab.

 

#! /usr/bin/env bash
# script bcftools_call-variants.sh
 
# required
# samtools v1.x (htslib)
# bcftools v1.x (htslib)
# vcftools 0.1.14
# vcf2index custom function
 
# if not done yet, source the user-defined .myfunctions (including vcf2index)
# . $HOME/.myfunctions
vcf2index {
# takes a vcf file as argument ($1)
# keeps header and sorts remaining lines
# compresses sorted file with bgzip
# indexes it with tabix for vcftools applications
# author: Stephane Plaisance
# version: 1.0
if [ $# -eq 1 ]
then
title=$(basename "$1" ".vcf")
( grep ^"#" $1 | perl -pi -e "s/$title.*$/$title/";
grep -v ^"#" $1 | sort -k 1V,1 -k 2n,2 ) | bgzip -c > $1".gz"
tabix -p vcf $1".gz"
else
echo "usage: vcf2index <vcf-file>"
fi
}
 
# we call from the full mapping results for chr21
# calling from the low coverage training data would not give confident calls
infolder=hg19_bwa-mapping
infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam
name=${infile%%.bam}
outfolder=bcftools_htslib_variants
mkdir -p ${outfolder2}
 
ref=ref/HiSeq_UCSC_hg19.fa
 
# call variants from pileup, filter and convert calls to VCF format
# samtools mpileup arguments
#  -u generate uncompress BCF output (to feed the pipe)
#  -f reference genome in fasta format related to the BWA index
# bcftools call arguments
#  -v output potential variant sites only (force -c)
#  -c SNP calling (force -e)
#  -O u for bcf-uncompressed output
# exclude calls where more than 1000 reads support a variation
 
## REM: the next command is made of two parts contained between '(...)' with separate error logs (2>...)
## only if (part-1) succeeds ('&&') => then (part-2) is started.
 
## part-1 creates a raw BCF file
## part-2 processes the BCF data, filters it, and creates a VCF dataset
(samtools mpileup -uf ${ref} ${infolder}/${infile} | \
	bcftools call -vc -O u > ${outfolder}/NA18507_var_bcftools.raw.bcf \
	2>${outfolder}/samtools_mpileup_bcftools_raw.err) && \
(bcftools view ${outfolder}/NA18507_var_bcftools.raw.bcf | \
	vcfutils varFilter -D1000 > \
	${outfolder}/NA18507_var_bcftools.flt-D1000.vcf \
	2>${outfolder}/samtools_mpileup_bcftools_filter.err)
 
# Sort, compress, and index the variant VCF files
vcf2index ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf && \
	rm ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf
 
# extract the chr21 subset of calls
vcftools --gzvcf ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf.gz \
	--chr chr21 \
	--out ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000 \
	--recode
 
# recoding introduces the word recode at the end of the file-name
# remove the recode tag from the name to simplify
mv ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.recode.vcf \
	${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf
 
# sort, compress and index using our custom 'vcf2index' function)
vcf2index ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf && \
	rm ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf
 
echo

review germline SNVs and Indels calls

> cat chr21_NA18507_var_bcftools.flt-D1000.log
VCFtools - 0.1.14
(C) Adam Auton and Anthony Marcketta 2009
 
Parameters as interpreted:
	--gzvcf varscan2_variants/NA18507_mpileup2cns.vcf.gz
	--chr chr21
	--out varscan2_variants/chr21_NA18507_varscan
	--recode
 
Using zlib version: 1.2.8
After filtering, kept 1 out of 1 Individuals
Outputting VCF file...
After filtering, kept 81460 out of a possible 84224 Sites
Run Time = 2.00 seconds
 
# count variants per chromosome
infolder=bcftools_htslib_variants
infile=NA18507_var_bcftools.flt-D1000.vcf.gz
 
zgrep -v "^#" ${infolder}/${infile} | \
	cut -f 1 | sort | uniq -c | sort -k 2V,2
  ....
  84864 chr21
  ....

 

Call variants with samtools and varscan

While Samtools calls variants using a probabilistic approach, Varscan2 [4] instead uses a more basic approach of filtering out what users consider as background. Several cutoffs can be defined to filter-out 'bad' mappings and keep 'good' ones for calling.

default varscan2 built-in filtering parameters are:

  • --min-coverage Minimum read depth at a position to make a call [8]
  • --min-reads2 Minimum supporting reads at a position to call variants [2]
  • --min-var-freq Minimum variant allele frequency threshold [0.20]
  • --min-avg-qual Minimum base quality at a position to count a read [15]
  • --min-freq-for-hom Minimum frequency to call homozygote [0.75]
  • --p-value Default p-value threshold for calling variants [0.01]
  • --strand-filter Ignore variants with >90% support on one strand [1]

VarScan2 comes in different flavours; a method for somatic variants, another for somatic CNV, and a more often used method for germline variants (used today).

Three VarScan subcommands will invoke the germline variant calling model

  • mpileup2snp - calls single nucleotide polymorphisms (SNPs)
  • mpileup2indel - calls insertions and deletions (indels)
  • mpileup2cns - calls a consensus genotype (reference, SNP, or indel)

These work with single-sample and multi-sample mpileup input. The first two (mpileup2snp and mpileup2indel) report *only* positions at which a variant of the given type (SNP and indel) was called. The third command (mpileup2cns) reports all positions that met the miniumum coverage, or (with the -v parameter), all positions at which a SNP or an indel was called.

read more in the varscan2 documentation pages[5]

The following bash script was prepared to perform variant calling and extract meaningful results.


VarScan.jar mpileup2cns command parameters

USAGE: java -jar VarScan.jar mpileup2cns [pileup file] OPTIONS
        mpileup file - The SAMtools mpileup file
 
        OPTIONS:
        --min-coverage  Minimum read depth at a position to make a call [8]
        --min-reads2    Minimum supporting reads at a position to call variants [2]
        --min-avg-qual  Minimum base quality at a position to count a read [15]
        --min-var-freq  Minimum variant allele frequency threshold [0.01]
        --min-freq-for-hom      Minimum frequency to call homozygote [0.75]
        --p-value       Default p-value threshold for calling variants [99e-02]
        --strand-filter Ignore variants with >90% support on one strand [1]
        --output-vcf    If set to 1, outputs in VCF format
        --vcf-sample-list       For VCF output, a list of sample names in order, one per line
        --variants      Report only variant (SNP/indel) positions [0]

Technical.png Use the --output-vcf 1 argument to get VCF 4.1 output

#! /usr/bin/env bash
# script varscan2-mpileup2cns_call-variants.sh
 
# required
# samtools version: 1.x (htslib)
# varscan2 version 2.4.1
# vcftools 0.1.14
# vcfutils.pl (samtools)
# vcf2index custom function
 
# make sure you have defined VARSCAN in your .bashrc
# export VARSCAN=<edit here the path to VarScan.v2.4.1.jar>
 
# if not done yet, source user-defined .myfunctions (including vcf2index)
# . $HOME/.myfunctions
# vcf2index is reproduced here for training
vcf2index {
# takes a vcf file as argument ($1)
# keeps header and sorts remaining lines
# compresses sorted file with bgzip
# indexes it with tabix for vcftools applications
# author: Stephane Plaisance
# version: 1.0
if [ $# -eq 1 ]; then
	title=$(basename "$1" ".vcf")
	( grep ^"#" $1 | perl -pi -e "s/$title.*$/$title/";
	grep -v ^"#" $1 | sort -k 1V,1 -k 2n,2 ) | bgzip -c > $1".gz"
	tabix -p vcf $1".gz"
else
	echo "usage: vcf2index <vcf-file>"
fi
}
 
# we call from the full mapping results for chr21
# calling from the low coverage training data would not give confident calls
infolder=hg19_bwa-mapping
infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam
 
outfolder=varscan2_variants
mkdir -p ${outfolder}
 
ref=ref/HiSeq_UCSC_hg19.fa
 
# call SNV and InDels variants from samtools mpileup
 
## limit allocated RAM
maxmem="4G"
 
samtools mpileup -f ${ref} ${infolder}/${infile} | \
	java -Xmx${maxmem} -jar $VARSCAN/VarScan.v2.4.1.jar mpileup2cns \
	--variants --output-vcf 1 --p-value 0.01 \
	>${outfolder}/NA18507_mpileup2cns.vcf \
	2>${outfolder}/varscan2_mpileup2cns.err
 
## compress and index result
bgzip -c ${outfolder}/NA18507_mpileup2cns.vcf \
	> ${outfolder}/NA18507_mpileup2cns.vcf.gz &&
	tabix -p vcf ${outfolder}/NA18507_mpileup2cns.vcf.gz
 
# extract the chr21 subset of calls
vcftools --gzvcf ${outfolder}/NA18507_mpileup2cns.vcf.gz \
	--chr chr21 \
	--out ${outfolder}/chr21_NA18507_varscan \
	--recode
 
# recoding introduces the word recode at the end of the file-name
# remove the recode tag from the name
mv ${outfolder}/chr21_NA18507_varscan.recode.vcf \
	${outfolder}/chr21_NA18507_varscan.vcf
 
# sort, compress and index using our custom 'vcf2index' function)
vcf2index ${outfolder}/chr21_NA18507_varscan.vcf && \
	rm ${outfolder}/chr21_NA18507_varscan.vcf
 
echo

review germline SNVs and Indels calls

> cat varscan2_mpileup2cns.err
Only variants will be reported
Min coverage:   8
Min reads2:     2
Min var freq:   0.2
Min avg qual:   15
P-value thresh: 0.01
Input stream not ready, waiting for 5 seconds...
Reading input from STDIN
35410265 bases in pileup file
85777 variant positions (74690 SNP, 11087 indel)
1553 were failed by the strand-filter
84224 variant positions reported (73229 SNP, 10995 indel)


 

Perform QC on the obtained VCF files using bcftools

As always in NGS, fresh data should always be QC'ed to avoid any surprise and intercept errors and biases asap.

We next perform quality control using the bcftools stats and plot-vcfstats commands from bcftools 1.x (htslib).

bcftools 1.x commands

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
Version: 1.3 (using htslib 1.3)
 
Usage:   bcftools [--version|--version-only] [--help] <command> <argument>
 
Commands:
 
 -- Indexing
    index        index VCF/BCF files
 
 -- VCF/BCF manipulation
    annotate     annotate and edit VCF/BCF files
    concat       concatenate VCF/BCF files from the same set of samples
    convert      convert VCF/BCF files to different formats and back
    isec         intersections of VCF/BCF files
    merge        merge VCF/BCF files files from non-overlapping sample sets
    norm         left-align and normalize indels
    plugin       user-defined plugins
    query        transform VCF/BCF into user-defined formats
    reheader     modify VCF/BCF header, change sample names
    view         VCF/BCF conversion, view, subset and filter VCF/BCF files
 
 -- VCF/BCF analysis
    call         SNP/indel calling
    consensus    create consensus sequence by applying VCF variants
    cnv          HMM CNV calling
    filter       filter VCF/BCF files using fixed thresholds
    gtcheck      check sample concordance, detect sample swaps and contamination
    roh          identify runs of autozygosity (HMM)
    stats        produce VCF/BCF stats
 
 Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
 automatically even when streaming from a pipe. Indexed VCF and BCF will work
 in all situations. Un-indexed VCF and BCF and streams will work in most but
 not all situations.

bcftools stats arguments

About:   Parses VCF or BCF and produces stats which can be plotted using plot-vcfstats.
         When two files are given, the program generates separate stats for intersection
         and the complements. By default only sites are compared, -s/-S must given to include
         also sample columns.
Usage:   bcftools stats [options] <A.vcf.gz> [<B.vcf.gz>]
 
Options:
    -1, --1st-allele-only              include only 1st allele at multiallelic sites
    -c, --collapse <string>            treat as identical records with <snps|indels|both|all|some|none>, see man page for details [none]
    -d, --depth <int,int,int>          depth distribution: min,max,bin size [0,500,1]
    -e, --exclude <expr>               exclude sites for which the expression is true (see man page for details)
    -E, --exons <file.gz>              tab-delimited file with exons for indel frameshifts (chr,from,to; 1-based, inclusive, bgzip compressed)
    -f, --apply-filters <list>         require at least one of the listed FILTER strings (e.g. "PASS,.")
    -F, --fasta-ref <file>             faidx indexed reference sequence file to determine INDEL context
    -i, --include <expr>               select sites for which the expression is true (see man page for details)
    -I, --split-by-ID                  collect stats for sites with ID separately (known vs novel)
    -r, --regions <region>             restrict to comma-separated list of regions
    -R, --regions-file <file>          restrict to regions listed in a file
    -s, --samples <list>               list of samples for sample stats, "-" to include all samples
    -S, --samples-file <file>          file of samples to include
    -t, --targets <region>             similar to -r but streams rather than index-jumps
    -T, --targets-file <file>          similar to -R but streams rather than index-jumps
    -u, --user-tstv <TAG[:min:max:n]>  collect Ts/Tv stats for any tag using the given binning [0:1:100]
    -v, --verbose                      produce verbose per-site and per-sample output

plot-vcfstats arguments

Usage: plot-vcfstats [OPTIONS] file.chk ...
       plot-vcfstats -p outdir/ file.chk ...
Options:
   -m, --merge                         Merge vcfstats files to STDOUT, skip plotting.
   -p, --prefix <path>                 The output files prefix, add a slash to create new directory.
   -P, --no-PDF                        Skip the PDF creation step.
   -r, --rasterize                     Rasterize PDF images for fast rendering.
   -s, --sample-names                  Use sample names for xticks rather than numeric IDs.
   -t, --title <string>                Identify files by these titles in plots. Can be given multiple times.
   -T, --main-title <string>           Main title for the PDF.
   -h, -?, --help                      This help message.

Assessing our full-depth calls with this utility is detailed in the next code-block.

# analysing the chr21 subset of our bwa-calls
 
# bcftools metrics
infolder=bcftools_htslib_variants
 
bcftools stats ${infolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz > \
	${infolder}/chr21_NA18507_var_bcftools.flt-D1000_var.check
 
plot-vcfstats -t "NA18507_bcftools-calls" \
	-p ${infolder}/chr21_bcftools_plots/ \
	${infolder}/chr21_NA18507_var_bcftools.flt-D1000_var.check
 
# VarScan metrics
infolder=varscan2_variants
 
bcftools stats ${infolder}/chr21_NA18507_varscan.vcf.gz > \
	${infolder}/chr21_varscan_var.check
 
plot-vcfstats -t "NA18507_varscan-calls" \
	-p ${infolder}/chr21_varscan_plots/ \
	${infolder}/chr21_varscan_var.check

Handicon.png Open PDF files using evince

If we prefer PNG files to PDF we can use ImageMagick already introduced in a former exercise. A custom pdf2png function has been prepared for you.

the pdf2png custom bash function

pdf2png ()
{
# convert PDF to PNG (all pages with suffix)
# take PDF file from 1
# use PDF file name for PNG with extension swap
# or take PNG file name from $2
if [ $# -ge 1 ]
then
convert -density 300 -depth 8 -quality 100 -resize 25% ${1} ${2:-${1%%.pdf}.png};
else
echo "usage: pdf2png <input.pdf> <opt:output.png>"
fi
}

The summary statistics derived from the VCF files indicate the overall quality of the mapping experiment and are reported below.

hg19 variants (bcftools) PDF (varscan2) PDF
summary-0.png
summary-0.png
summary-1.png
summary-1.png
summary-2.png
summary-2.png
summary-2.png

The inspection of the resulting *summary.pdf* files shows a ts/tv ratio of 2.04 for the *bcftools* method against 2.05 for the *varscan2* method. In human genome, transitions are expected twice more frequently than transversions (ts/tv~2) and ts/tv ratio of 2.4 is often reported as optimal by NGS experts. The obtained data is in range.

 

download exercise files

Download exercise files here

Use the right application to open the files present in ex4-files

References:
  1. http://www.1000genomes.org
  2. http://en.wikipedia.org/wiki/Variant_Call_Format
  3. https://samtools.github.io/hts-specs/VCFv4.3.pdf
  4. Daniel C Koboldt, Ken Chen, Todd Wylie, David E Larson, Michael D McLellan, Elaine R Mardis, George M Weinstock, Richard K Wilson, Li Ding
    VarScan: variant detection in massively parallel sequencing of individual and pooled samples.
    Bioinformatics: 2009, 25(17);2283-5
    [PubMed:19542151] ##WORLDCAT## [DOI] (I p)

  5. http://dkoboldt.github.io/varscan/using-varscan.html

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.3 | NGS-Var Exercise.5 ]