NGS Exercise.5

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.4 | NGS Exercise.5_GATK | NGS Exercise.6 ]
# updated 2014 version


Call variations by comparing the aligned reads with the reference sequence


ex05_wf.png

quick intro on VCF format

Since the expansion of the 1000 genome project[1], the Variant Call Format has become more and more popular and is today the default format to represent sequence variation. VCF is a tabular text format that provides rich information about each position different from the reference genome. It also includes different scores obtained during sequencing, alignment, and calling to allow quality filtering as well as added sequence annotations to allow annotation-driven filtering. Various tools have been developed to manipulate VCF data and are exemplified during this session.

An example of VCF data is provided here as a primer, users will get more information in NGS-formats or from the VCF documentation <http://en.wikipedia.org/wiki/Variant_Call_Format>[2].


vcf_fig1.png
vcf_fig2.png

About bgzip-compressed VCF data and indexing

Similarly to SAM vs BAM and in order to speed VCF compatible tools, raw VCF data needs to be sorted and indexed. We created for this purpose a simple 'bash custom function' named vcf2index that is 'defined (declared)' on your training laptop. The function will be applied below to generate binary compressed gz versions and indexes for each VCF result.

the vcf2index custom bash function

vcf2index ()
{
# takes a vcf file as argument ($1)
# keeps header and sorts remaining lines
# compresses sorted file with bgzip
# indexes it with tabix for vcftools applications
# author: Stephane Plaisance
# version: 1.0
    if [ $# -eq 1 ]; then
        title=$(basename "$1" ".vcf");
        ( grep ^"#" $1 | perl -pi -e "s/$title.*$/$title/";
        grep -v ^"#" $1 | sort -k 1V,1 -k 2n,2 ) | bgzip -c > $1".gz";
        tabix -p vcf $1".gz";
    else
        echo "usage: vcf2index <vcf-file>";
    fi
}

 

Technical.png Custom functions sourced from the '.bashrc' configuration file (default behavior on the BITS laptops) are only active in the terminal but not from within scripts. To use your custom functions in scripts, simply source the function-containing file in the begining of your script (our custom functions are stored under '$HOME/.myfunctions')

Effect of sourcing custom functions in bash script

! /usr/bin/env bash
# 1) save this code in a file called 'test.sh' anywhere in your path
# 2) change the file rights to become executable: 'chmod a+x test.sh'
# 3) execute 'test.sh' and see what happens
 
# list vcf2index
echo  "# try before sourcing"
declare -f vcf2index
 
# list vcf2index
echo  "# try again after sourcing"
. $HOME/.myfunctions
declare -f vcf2index

 

call SNV and short indel variants

A wikibook page (in construction) is dedicated to variant calling from NGS(<http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/DNA_Variants>).

Several methods are available to convert read mapping information to variant calls. They are all based on local pileup alignment to the reference genome and consensus extraction but different callers use different strategies to limit the effect of wrongly mapped or low quality read sequences on the final result. We used here samtools and bcftools to call SNVs and indels rather than the more performant but difficult to use GATK competitor (see NGS_Exercise.5_GATK).

call variants with samtools and samtools bcftools

Samtools in recent versions has replaces pileup with mpileup. Keep this in mind when you read online documentation as functionalities have significantly changed with mpileup. During the workflow presented below, several steps are undergone starting with mpileup of reads, assembly of the results with bcftools to produce vcf data, and finally compress the obtained VCF (all in one smooth piped workflow to avoid producing intermediate result files and save disk space).

various manpages

samtools mpileup arguments

Usage: samtools mpileup [options] in1.bam [in2.bam [...]]
 
Input options:
 
       -6           assume the quality is in the Illumina-1.3+ encoding
       -A           count anomalous read pairs
       -B           disable BAQ computation
       -b FILE      list of input BAM filenames, one per line [null]
       -C INT       parameter for adjusting mapQ; 0 to disable [0]
       -d INT       max per-BAM depth to avoid excessive memory usage [250]
       -E           recalculate extended BAQ on the fly thus ignoring existing BQs
       -f FILE      faidx indexed reference sequence file [null]
       -G FILE      exclude read groups listed in FILE [null]
       -l FILE      list of positions (chr pos) or regions (BED) [null]
       -M INT       cap mapping quality at INT [60]
       -r STR       region in which pileup is generated [null]
       -R           ignore RG tags
       -q INT       skip alignments with mapQ smaller than INT [0]
       -Q INT       skip bases with baseQ/BAQ smaller than INT [13]
       --rf INT     required flags: skip reads with mask bits unset []
       --ff INT     filter flags: skip reads with mask bits set []
 
Output options:
 
       -D           output per-sample DP in BCF (require -g/-u)
       -g           generate BCF output (genotype likelihoods)
       -O           output base positions on reads (disabled by -g/-u)
       -s           output mapping quality (disabled by -g/-u)
       -S           output per-sample strand bias P-value in BCF (require -g/-u)
       -u           generate uncompress BCF output
 
SNP/INDEL genotype likelihoods options (effective with `-g' or `-u'):
 
       -e INT       Phred-scaled gap extension seq error probability [20]
       -F FLOAT     minimum fraction of gapped reads for candidates [0.002]
       -h INT       coefficient for homopolymer errors [100]
       -I           do not perform indel calling
       -L INT       max per-sample depth for INDEL calling [250]
       -m INT       minimum gapped reads for indel candidates [1]
       -o INT       Phred-scaled gap open sequencing error probability [40]
       -p           apply -m and -F per-sample to increase sensitivity
       -P STR       comma separated list of platforms for indels [all]
 
Notes: Assuming diploid individuals.

bcftools view (samtools) arguments

Usage: bcftools view [options] <in.bcf> [reg]
 
Input/output options:
 
       -A        keep all possible alternate alleles at variant sites
       -b        output BCF instead of VCF
       -D FILE   sequence dictionary for VCF->BCF conversion [null]
       -F        PL generated by r921 or before (which generate old ordering)
       -G        suppress all individual genotype information
       -l FILE   list of sites (chr pos) or regions (BED) to output [all sites]
       -L        calculate LD for adjacent sites
       -N        skip sites where REF is not A/C/G/T
       -Q        output the QCALL likelihood format
       -s FILE   list of samples to use [all samples]
       -S        input is VCF
       -u        uncompressed BCF output (force -b)
 
Consensus/variant calling options:
 
       -c        SNP calling (force -e)
       -d FLOAT  skip loci where less than FLOAT fraction of samples covered [0]
       -e        likelihood based analyses
       -g        call genotypes at variant sites (force -c)
       -i FLOAT  indel-to-substitution ratio [-1]
       -I        skip indels
       -m FLOAT  alternative model for multiallelic and rare-variant calling, include if P(chi^2)>=FLOAT
       -p FLOAT  variant if P(ref|D)<FLOAT [0.5]
       -P STR    type of prior: full, cond2, flat [full]
       -t FLOAT  scaled substitution mutation rate [0.001]
       -T STR    constrained calling; STR can be: pair, trioauto, trioxd and trioxs (see manual) [null]
       -v        output potential variant sites only (force -c)
 
Contrast calling and association test options:
 
       -1 INT    number of group-1 samples [0]
       -C FLOAT  posterior constrast for LRT<FLOAT and P(ref|D)<0.5 [1]
       -U INT    number of permutations for association testing (effective with -1) [0]
       -X FLOAT  only perform permutations for P(chi^2)<FLOAT [0.01]

The classical method

Given the observed average genome coverage of ~40x, very high depth loci can only originate from copy number effects and should not be trusted. The vcfutils.pl varFilter -D1000 filtering step ensures that read depth above 1000 will not be taken into account to call variants (Excessive piling up of reads is indicative of poorly mapped region). Additional filtering can be performed based on quality scores or several other useful parameters listed below. Please feel free to play with these back in the lab.

#! /usr/bin/env bash
# script samtools_call-variants.sh
 
# required
# samtools  Version: 0.1.19
# bcftools(_samtools) view (samtools version), vcfutils(_samtools).pl
# bcftools(_htslib) call, vcfutils(_htslib).pl
 
# source user-defined functions (including vcf2index)
. /home/bits/.myfunctions
 
# we call from the full mapping results for chr21 provided in 'Final_Results'
# calling from the low coverage training data would not give confident calls
infolder=hg19_bwa-mapping/full_chromosome
 
name=shuffled_PE_NA18507_GAIIx_100_chr21
# shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam
# shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam
 
outfolder=bcftools_samtools_variants
mkdir -p ${outfolder}
 
ref=ref/HiSeq_UCSC_hg19.fa
 
# call variants from pileup, filter and convert calls to VCF format
# exclude calls where more than 1000 reads support a variation
 
for met in "aln" "mem"; do
 
# samtools mpileup arguments
# -u generate uncompress BCF output (to feed the pipe)
# (not used) -r region in which pileup is generated (chr21)
# -f reference genome in fasta format related to the BWA index
 
# samtools bcftools view arguments
# -b output BCF instead of VCF
# -v output potential variant sites only (force -c)
# -c SNP calling (force -e)
# -e likelihood based analyses (inherited from -c)
# -g call genotypes at variant sites (force -c)
 
## REM: the next command is made of two parts contained between '(...)' with separate error logs (2>...)
## only if (part-1) succeeds ('&&') => then (part-2) is started.
## part-1 creates a raw BCF file which is then filtered in part-2 to lead to the VCF data
(samtools mpileup -uf ${ref} ${infolder}/${name}_${met}-pe_mdup_chr21.bam | \
 bcftools view -bvceg - > \
 ${outfolder}/NA18507_${met}_var.raw.bcf \
 2>${outfolder}/samtools_mpileup_${met}_raw.err) && \
(bcftools view ${outfolder}/NA18507_${met}_var.raw.bcf | \
 vcfutils.pl varFilter -D1000 > \
 ${outfolder}/NA18507_${met}_var.flt-D1000.vcf \
 2>${outfolder}/samtools_mpileup_${met}_filter.err)
 
# Sort, compress, and index the variant VCF files
vcf2index ${outfolder}/NA18507_${met}_var.flt-D1000.vcf && \
	rm ${outfolder}/NA18507_${met}_var.flt-D1000.vcf
 
# extract the chr21 subset of calls
vcftools --gzvcf ${outfolder}/NA18507_${met}_var.flt-D1000.vcf.gz \
	--chr chr21 \
	--out ${outfolder}/chr21_NA18507_${met}_var.flt-D1000 \
	--recode
 
# recoding introduces the word recode at the end of the file-name
# remove the recode tag from the name
mv ${outfolder}/chr21_NA18507_${met}_var.flt-D1000.recode.vcf \
	${outfolder}/chr21_NA18507_${met}_var.flt-D1000.vcf
 
# sort, compress and index using our custom 'vcf2index' function)
vcf2index ${outfolder}/chr21_NA18507_${met}_var.flt-D1000.vcf && \
	rm ${outfolder}/chr21_NA18507_${met}_var.flt-D1000.vcf
 
done

call variants with samtools version 1.x and bcftools 1.x (both using htslib)

various manpages

samtools 1.3 (htslib) arguments

Program: samtools (Tools for alignments in the SAM format)
Version: 1.3 (using htslib 1.3)
 
Usage:   samtools <command> [options]
 
Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     index          index alignment
 
  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     rmdup          remove PCR duplicates
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
 
  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA
 
  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)
 
  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

bcftools 1.3 (htslib) arguments

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
Version: 1.1 (using htslib 1.1)
 
Usage:   bcftools <command> <argument>
 
Commands:
 
 -- Indexing
    index        index VCF/BCF files
 
 -- VCF/BCF manipulation
    annotate     annotate and edit VCF/BCF files
    concat       concatenate VCF/BCF files from the same set of samples
    convert      convert VCF/BCF files to different formats and back
    isec         intersections of VCF/BCF files
    merge        merge VCF/BCF files files from non-overlapping sample sets
    norm         left-align and normalize indels
    plugin       user-defined plugins
    query        transform VCF/BCF into user-defined formats
    reheader     modify VCF/BCF header, change sample names
    view         VCF/BCF conversion, view, subset and filter VCF/BCF files
 
 -- VCF/BCF analysis
    call         SNP/indel calling
    filter       filter VCF/BCF files using fixed thresholds
    gtcheck      check sample concordance, detect sample swaps and contamination
    roh          identify runs of autozygosity (HMM)
    stats        produce VCF/BCF stats
 
 Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
 automatically even when streaming from a pipe. Indexed VCF and BCF will work
 in all situations. Un-indexed VCF and BCF and streams will work in most but
 not all situations.

bcftools call (htslib) arguments

About:   SNP/indel variant calling from VCF/BCF. To be used in conjunction with samtools mpileup.
         This command replaces the former "bcftools view" caller. Some of the original
         functionality has been temporarily lost in the process of transition to htslib,
         but will be added back on popular demand. The original calling model can be
         invoked with the -c option.
Usage:   bcftools call [options] <in.vcf.gz>
 
File format options:
   -o, --output <file>             write output to a file [standard output]
   -O, --output-type <b|u|z|v>     output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
   -r, --regions <region>          restrict to comma-separated list of regions
   -R, --regions-file <file>       restrict to regions listed in a file
   -s, --samples <list>            list of samples to include [all samples]
   -S, --samples-file <file>       PED file or a file with optional second column for ploidy (0, 1 or 2) [all samples]
   -t, --targets <region>          similar to -r but streams rather than index-jumps
   -T, --targets-file <file>       similar to -R but streams rather than index-jumps
 
Input/output options:
   -A, --keep-alts                 keep all possible alternate alleles at variant sites
   -f, --format-fields <list>      output format fields: GQ,GP (lowercase allowed) []
   -g, --gvcf <minDP>              output gVCF blocks of homozygous REF calls
   -i, --insert-missed             output also sites missed by mpileup but present in -T
   -M, --keep-masked-ref           keep sites with masked reference allele (REF=N)
   -V, --skip-variants <type>      skip indels/snps
   -v, --variants-only             output variant sites only
 
Consensus/variant calling options:
   -c, --consensus-caller          the original calling method (conflicts with -m)
   -C, --constrain <str>           one of: alleles, trio (see manual)
   -m, --multiallelic-caller       alternative model for multiallelic and rare-variant calling (conflicts with -c)
   -n, --novel-rate <float>,[...]  likelihood of novel mutation for constrained trio calling, see man page for details [1e-8,1e-9,1e-9]
   -p, --pval-threshold <float>    variant if P(ref|D)<FLOAT with -c [0.5]
   -P, --prior <float>             mutation rate [1e-3]
   -X, --chromosome-X              haploid output for male samples (requires PED file with -s)
   -Y, --chromosome-Y              haploid output for males and skips females (requires PED file with -s)

bcftools view (htslib) arguments

About:   VCF/BCF conversion, view, subset and filter VCF/BCF files.
Usage:   bcftools view [options] <in.vcf.gz> [region1 [...]]
 
Output options:
    -G,   --drop-genotypes              drop individual genotype information (after subsetting if -s option set)
    -h/H, --header-only/--no-header     print the header only/suppress the header in VCF output
    -l,   --compression-level [0-9]     compression level: 0 uncompressed, 1 best speed, 9 best compression [-1]
    -o,   --output-file <file>          output file name [stdout]
    -O,   --output-type <b|u|z|v>       b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
    -r, --regions <region>              restrict to comma-separated list of regions
    -R, --regions-file <file>           restrict to regions listed in a file
    -t, --targets [^]<region>           similar to -r but streams rather than index-jumps. Exclude regions with "^" prefix
    -T, --targets-file [^]<file>        similar to -R but streams rather than index-jumps. Exclude regions with "^" prefix
 
Subset options:
    -a, --trim-alt-alleles        trim alternate alleles not seen in the subset
    -I, --no-update               do not (re)calculate INFO fields for the subset (currently INFO/AC and INFO/AN)
    -s, --samples [^]<list>       comma separated list of samples to include (or exclude with "^" prefix)
    -S, --samples-file [^]<file>  file of samples to include (or exclude with "^" prefix)
        --force-samples           only warn about unknown subset samples
 
Filter options:
    -c/C, --min-ac/--max-ac <int>[:<type>]      minimum/maximum count for non-reference (nref), 1st alternate (alt1) or minor (minor) alleles [nref]
    -f,   --apply-filters <list>                require at least one of the listed FILTER strings (e.g. "PASS,.")
    -g,   --genotype [^]<hom|het|miss>          require one or more hom/het/missing genotype or, if prefixed with "^", exclude sites with hom/het/missing genotypes
    -i/e, --include/--exclude <expr>            select/exclude sites for which the expression is true (see man page for details)
    -k/n, --known/--novel                       select known/novel sites only (ID is not/is '.')
    -m/M, --min-alleles/--max-alleles <int>     minimum/maximum number of alleles listed in REF and ALT (e.g. -m2 -M2 for biallelic sites)
    -p/P, --phased/--exclude-phased             select/exclude sites where all samples are phased
    -q/Q, --min-af/--max-af <float>[:<type>]    minimum/maximum frequency for non-reference (nref), 1st alternate (alt1) or minor (minor) alleles [nref]
    -u/U, --uncalled/--exclude-uncalled         select/exclude sites without a called genotype
    -v/V, --types/--exclude-types <list>        select/exclude comma-separated list of variant types: snps,indels,mnps,other [null]
    -x/X, --private/--exclude-private           select/exclude sites where the non-reference alleles are exclusive (private) to the subset samples

bcftools filter (htslib) arguments

About:   Apply fixed-threshold filters.
Usage:   bcftools filter [options] <in.vcf.gz>
 
Options:
    -e, --exclude <expr>          exclude sites for which the expression is true (see man page for details)
    -g, --SnpGap <int>            filter SNPs within <int> base pairs of an indel
    -G, --IndelGap <int>          filter clusters of indels separated by <int> or fewer base pairs allowing only one to pass
    -i, --include <expr>          include only sites for which the expression is true (see man page for details
    -m, --mode [+x]               "+": do not replace but add to existing FILTER; "x": reset filters at sites which pass
    -o, --output <file>           write output to a file [standard output]
    -O, --output-type <b|u|z|v>   b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
    -r, --regions <region>        restrict to comma-separated list of regions
    -R, --regions-file <file>     restrict to regions listed in a file
    -s, --soft-filter <string>    annotate FILTER column with <string> or unique filter name ("Filter%d") made up by the program ("+")
    -S, --set-GTs <.|0>           set genotypes of failed samples to missing (.) or ref (0)
    -t, --targets <region>        similar to -r but streams rather than index-jumps
    -T, --targets-file <file>     similar to -R but streams rather than index-jumps

the script using the recent htslib versions of samtools and bcftools

#! /usr/bin/env bash
# script bcftools_call-variants.sh
 
# required
# samtools v1.x (htslib)
# bcftools v1.x (htslib)
# vcf2index custom function
 
# make sure you have defined VARSCAN in your .bashrc
# export VARSCAN=<edit here the path to VarScan.v2.4.1.jar>
 
# if not done yet, source user-defined functions (including vcf2index)
# . /home/bits/.myfunctions
 
# we call from the full mapping results for chr21
# calling from the low coverage training data would not give confident calls
infolder=hg19_bwa-mapping/
infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam
name=${infile%%.bam}
outfolder=bcftools_htslib_variants
mkdir -p ${outfolder2}
 
ref=ref/HiSeq_UCSC_hg19.fa
 
# call variants from pileup, filter and convert calls to VCF format
# samtools mpileup arguments
#  -u generate uncompress BCF output (to feed the pipe)
#  -f reference genome in fasta format related to the BWA index
# bcftools call arguments
#  -v output potential variant sites only (force -c)
#  -c SNP calling (force -e)
#  -O u for bcf-uncompressed output
# exclude calls where more than 1000 reads support a variation
 
## REM: the next command is made of two parts contained between '(...)' with separate error logs (2>...)
## only if (part-1) succeeds ('&&') => then (part-2) is started.
 
## part-1 creates a raw BCF file
(samtools mpileup -uf ${ref} ${infolder}/${infile} | \
	bcftools call -vc -O u > ${outfolder}/NA18507_var_bcftools.raw.bcf \
	2>${outfolder}/samtools_mpileup_bcftools_raw.err) && \
 
## part-2 processes the BCF data, filters it, and creates a VCF dataset
(bcftools view ${outfolder}/NA18507_var_bcftools.raw.bcf | \
	vcfutils varFilter -D1000 > \
	${outfolder}/NA18507_var_bcftools.flt-D1000.vcf \
	2>${outfolder}/samtools_mpileup_bcftools_filter.err)
 
# Sort, compress, and index the variant VCF files
vcf2index ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf && \
	rm ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf
 
# extract the chr21 subset of calls
vcftools --gzvcf ${outfolder}/NA18507_var_bcftools.flt-D1000.vcf.gz \
	--chr chr21 \
	--out ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000 \
	--recode
 
# recoding introduces the word recode at the end of the file-name
# remove the recode tag from the name to simplify
mv ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.recode.vcf \
	${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf
 
# sort, compress and index using our custom 'vcf2index' function)
vcf2index ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf && \
	rm ${outfolder}/chr21_NA18507_var_bcftools.flt-D1000.vcf
 
echo

call variants with varScan2

While Samtools calls variants using a probabilistic approach, Varscan2 [3] instead uses a more basic approach of filtering for what users consider as background. Several cutoffs can be defined to filter-out 'bad' mappings and keep 'good' ones for calling.

Handicon.png This method is not discussed during the session but is presented here so that you can run it back in the lab and compare results with those obtained using samtools

default varscan2 built-in filtering parameters are:

  • --min-coverage Minimum read depth at a position to make a call [8]
  • --min-reads2 Minimum supporting reads at a position to call variants [2]
  • --min-var-freq Minimum variant allele frequency threshold [0.20]*(0.01)
  • --min-avg-qual Minimum base quality at a position to count a read [15]
  • --min-freq-for-hom Minimum frequency to call homozygote [0.75]?
  • --p-value Default p-value threshold for calling variants [0.01]*(0.99)
  • --strand-filter Ignore variants with >90% support on one strand [1]?

Remarks:
(*) values reported during a dry run are different from the ones in the manual page (?) are not reported and their default value could not be verified

read more in the varscan2 documentation pages

available varscan methods

methods for germline variants

  • mpileup2snp - calls single nucleotide polymorphisms (SNPs)
  • mpileup2indel - calls insertions and deletions (indels)
  • mpileup2cns - calls a consensus genotype (reference, SNP, or indel)

The first two (mpileup2snp and mpileup2indel) report *only* positions at which a variant of the given type (SNP and indel) was called. The third command (mpileup2cns) reports all positions that met the miniumum coverage, or (with the -v parameter), all positions at which a SNP or an indel was called.

Technical.png Use the --output-vcf 1 argument to get VCF 4.1 output

method for somatic variants

this method expects two sequencing datasets, one for the tumor sample and one for the 'normal' matched sample. The syntax of the command for somatic mutation calling differs somewhat from germline calling subcommands.

java -jar VarScan.jar somatic normal.pileup tumor.pileup output.basename

The above command will report germline, somatic, and LOH events at positions where both normal and tumor samples have sufficient coverage (default: 8). If the --validation flag is set to 1, it also report non-variant (reference) calls at all positions with sufficient coverage, for the purpose of refuting false positives with validation data.

method for somatic CNV variants

A separate method (not discussed here) calls Somatic Copy Number Alteration (CNA) from cancer/normal sample pairs.

VarScan v2.2.4 and later includes novel functionality to infer somatic copy number changes using data from matched tumor-normal pairs. This is designed specifically for exome sequencing, in which a tumor sample and its matched normal were captured and sequenced under identical conditions. By performing a pairwise comparison of read depth between the samples at each position in the exome, it becomes possible to infer relative changes in copy number in the tumor sample. The output of this tool is a set of regions with chromosome, start, stop, and log-base-2 of the copy number change, which is similar to array-based copy number data and therefore amenable to the same segmentation algorithms.

call germline SNVs and Indels together

#! /usr/bin/env bash
# script varscan2-mpileup2cns_call-variants.sh
 
# required
# samtools  Version: 0.1.19
# varscan2 version 2.3.6
 
# source user-defined functions (including vcf2index)
. /home/bits/.myfunctions
 
# we call from the full mapping results for chr21 provided in 'Final_Results'
# calling from the low coverage training data would not give confident calls
infolder=hg19_bwa-mapping/full_chromosome
 
name=shuffled_PE_NA18507_GAIIx_100_chr21
# shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam
# shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam
 
outfolder=varscan2_variants
mkdir -p ${outfolder}
 
ref=ref/HiSeq_UCSC_hg19.fa
 
# call SNV and InDels variants from samtools mpileup
 
for met in "aln" "mem"; do
 
## call both in one GO
samtools mpileup -f ${ref} ${infolder}/${name}_${met}-pe_mdup_chr21.bam | \
   java -Xmx4G -jar /usr/bin/tools/VarScan.v2.3.7.jar  mpileup2cns --variants --output-vcf 1 \
   >${outfolder}/NA18507_${met}_mpileup2cns.vcf \
   2>${outfolder}/varscan2_mpileup2cns_${met}.err
 
## compress and index result
bgzip -c ${outfolder}/NA18507_${met}_mpileup2cns.vcf \
    > ${outfolder}/NA18507_${met}_mpileup2cns.vcf.gz &&
    tabix -p vcf ${outfolder}/NA18507_${met}_mpileup2cns.vcf.gz
 
# extract the chr21 subset of calls
vcftools --gzvcf ${outfolder}/NA18507_${met}_mpileup2cns.vcf.gz \
	--chr chr21 \
	--out ${outfolder}/NA18507_${met}_varscan_chr21 \
	--recode
 
# recoding introduces the word recode at the end of the file-name
# remove the recode tag from the name
mv ${outfolder}/NA18507_${met}_varscan_chr21.recode.vcf \
	${outfolder}/NA18507_${met}_varscan_chr21.vcf
 
# sort, compress and index using our custom 'vcf2index' function)
vcf2index ${outfolder}/NA18507_${met}_varscan_chr21.vcf && \
	rm ${outfolder}/NA18507_${met}_varscan_chr21.vcf
 
done

review germline SNVs and Indels calls

# Varscan2 results for the 'aln' mpileup
34023715 bases in pileup file
81628 variant positions (70998 SNP, 10630 indel)
931 were failed by the strand-filter
80697 variant positions reported (70141 SNP, 10556 indel)

# Varscan2 results for the 'mem' mpileup

34025577 bases in pileup file
82408 variant positions (71478 SNP, 10930 indel)
979 were failed by the strand-filter
81429 variant positions reported (70569 SNP, 10860 indel)

call germline SNVs and Indels separately

bash script to call SNVs and Indels

#! /usr/bin/env bash
# script varscan2_call-variants.sh
 
# required
# samtools  Version: 0.1.19
# varscan2 version 2.3.6
 
# source user-defined functions (including vcf2index)
. $HOME/.myfunctions
 
# we call from the full mapping results for chr21 provided in 'Final_Results'
# calling from the low coverage training data would not give confident calls
infolder=Final_Results/hg19_bwa-mapping
 
name=shuffled_PE_NA18507_GAIIx_100_chr21
# shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam
# shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam
 
outfolder=varscan2_variants
mkdir -p ${outfolder}
 
ref=ref/HiSeq_UCSC_hg19.fa
 
# call SNV and InDels variants from samtools mpileup
 
for met in "aln" "mem"; do
 
## call SNVs
## send results (STDOUT) to a VCF file and comments or errors (STDERR) to a separate '.err' file
samtools_0.1.19 mpileup -f ${ref} ${infolder}/${name}_${met}-pe_mdup_chr21.bam | \
   java -Xmx4G -jar ${BIOTOOLS}/varscan2/VarScan.v2.3.6.jar mpileup2snp --variants --output-vcf 1 \
   >${outfolder}/NA18507_${met}_mpileup2snp.vcf \
   2>${outfolder}/varscan2_mpileup2snp_${met}.err
 
## call Indels
## send results (STDOUT) to a VCF file and comments or errors (STDERR) to a separate '.err' file
samtools_0.1.19 mpileup -f ${ref} ${infolder}/${name}_${met}-pe_mdup_chr21.bam | \
   java -Xmx4G -jar ${BIOTOOLS}/varscan2/VarScan.v2.3.6.jar mpileup2indel --variants --output-vcf 1 \
   >${outfolder}/NA18507_${met}_mpileup2indel.vcf \
   2>${outfolder}/varscan2_mpileup2indel_${met}.err
 
# concat, sort, compress, and index the variant VCF files
cat ${outfolder}/NA18507_${met}_mpileup2snp.vcf \
   ${outfolder}/NA18507_${met}_mpileup2indel.vcf | \
   vcf-sort -c | bgzip -c > \
   ${outfolder}/NA18507_${met}_varscan.vcf.gz &&
   tabix -p vcf ${outfolder}/NA18507_${met}_varscan.vcf.gz
 
# extract the chr21 subset of calls
vcftools --gzvcf ${outfolder}/NA18507_${met}_varscan.vcf.gz \
	--chr chr21 \
	--out ${outfolder}/NA18507_${met}_varscan_chr21 \
	--recode
 
# recoding introduces the word recode at the end of the file-name
# remove the recode tag from the name
mv ${outfolder}/NA18507_${met}_varscan_chr21.recode.vcf \
	${outfolder}/NA18507_${met}_varscan_chr21.vcf
 
# sort, compress and index using our custom 'vcf2index' function)
vcf2index ${outfolder}/NA18507_${met}_varscan_chr21.vcf && \
	rm ${outfolder}/NA18507_${met}_varscan_chr21.vcf
 
done

review germline SNVs and Indels calls

# Varscan2 results for the 'aln' mpileup

SNVs

34023715 bases in pileup file
81628 variant positions (70998 SNP, 10630 indel)
931 were failed by the strand-filter
70141 variant positions reported (70141 SNP, 0 indel)

INDELs

34023715 bases in pileup file
81628 variant positions (70998 SNP, 10630 indel)
931 were failed by the strand-filter
10556 variant positions reported (0 SNP, 10556 indel)

# Varscan2 results for the 'mem' mpileup

SNVs

34025577 bases in pileup file
82408 variant positions (71478 SNP, 10930 indel)
979 were failed by the strand-filter
70569 variant positions reported (70569 SNP, 0 indel)

INDELs

34025577 bases in pileup file
82408 variant positions (71478 SNP, 10930 indel)
979 were failed by the strand-filter
10860 variant positions reported (0 SNP, 10860 indel)

call high quality SNP and Indel variants using the Broad GATK

The Broad Institute is continuously developing a framework used for the 1000genome project and that has become the gold standard in NGS DNA analysis. The GATK toolbox is today high-end standard but is not covered in this training due to license limitation for the use of Broad software. Any scientist from a University lab is allowed to download and use GATK but access to the software becomes more tricky for us as a non-profit institute facility.

A comparative analysis of results obtained from the bwa mem mapping followed by bcftools or GATK calling is provided in a separate document for your convenience but will not be included in this session. (NGS Exercise.5_GATK: Call variations using the GATK)

review bcftools (samtools) variant calls

Variant calls obtained after remapping chr21 reads to the full human genome are expected to also include a small number of variants on other chromosomes than chr21. This is due to the high sequence similarity of chromosomal loci across the genome and to the ambiguous nature of short read mapping.

Technical.png Because we took the subset of reads that re-mapped to chr21 with BWA, we will not 'see' this off-target effect

# analysis of 'aln' calls
infolder=bcftools_samtools_variants
infile=NA18507_aln_var.flt-D1000.vcf.gz
 
zgrep -v "^#" ${infolder}/${infile} | \
	cut -f 1 | sort | uniq -c | sort -k 2V,2

variant counts for the aln method

  83416 chr21

As seen here *samtools variants are predicted on all chromosomes while the reads were selected based on original chr21 mapping information. This is common in genome sequencing and reflects the redundant nature of the human genome where regions of high similarity are present in different chromosomes and render mapping of short reads ambiguous.

# analysis of 'mem' calls
infolder=bcftools_samtools_variants
infile=NA18507_mem_var.flt-D1000.vcf.gz
 
zgrep -v "^#" ${infolder}/${infile} | \
	cut -f 1 | sort | uniq -c | sort -k 2V,2

variant counts for the mem method

  84799 chr21

Handicon.png Note that the mem data contains 1366 more chr21 calls

perform QC on the obtained VCF files

As always in NGS, fresh data should always be QC'ed to avoid any surprise and intercept errors and biases asap.

Technical.png We do the QC on the data stored with bcftools_samtool_variant, the analog QC was run for the bcftools_htslib_variants data and results are available on our server

use bcftools_0.2.0 stats and plot-vcfstats_0.2.0 to perform QC on VCF data

We next perform quality control using the bcftools_0.2.0 stats and plot-vcfstats_0.2.0 commands from bcftools (htslib).

bcftools stats arguments

About:   Parses VCF or BCF and produces stats which can be plotted using plot-vcfstats.
         When two files are given, the program generates separate stats for intersection
         and the complements.
Usage:   bcftools stats [options] <A.vcf.gz> [<B.vcf.gz>]
 
Options:
    -1, --1st-allele-only              include only 1st allele at multiallelic sites
    -c, --collapse <string>            treat as identical records with <snps|indels|both|all|some|none>, see man page for details [none]
    -d, --depth <int,int,int>          depth distribution: min,max,bin size [0,500,1]
        --debug                        produce verbose per-site and per-sample output
    -e, --exons <file.gz>              tab-delimited file with exons for indel frameshifts (chr,from,to; 1-based, inclusive, bgzip compressed)
    -f, --apply-filters <list>         require at least one of the listed FILTER strings (e.g. "PASS,.")
    -F, --fasta-ref <file>             faidx indexed reference sequence file to determine INDEL context
    -i, --split-by-ID                  collect stats for sites with ID separately (known vs novel)
    -r, --regions <region>             restrict to comma-separated list of regions
    -R, --regions-file <file>          restrict to regions listed in a file
    -s, --samples <list>               list of samples for sample stats, "-" to include all samples
    -S, --samples-file <file>          file of samples to include
    -t, --targets <region>             similar to -r but streams rather than index-jumps
    -T, --targets-file <file>          similar to -R but streams rather than index-jumps
    -u, --user-tstv <TAG[:min:max:n]>  collect Ts/Tv stats for any tag using the given binning [0:1,100]

plot-vcfstats arguments

Usage: plot-vcfstats [OPTIONS] file.chk ...
       plot-vcfstats -p outdir/ file.chk ...
Options:
   -m, --merge                         Merge vcfstats files to STDOUT, skip plotting.
   -p, --prefix <path>                 The output files prefix, add a slash to create new directory.
   -P, --no-PDF                        Skip the PDF creation step.
   -r, --rasterize                     Rasterize PDF images for fast rendering.
   -s, --sample-names                  Use sample names for xticks rather than numeric IDs.
   -t, --title <string>                Identify files by these titles in plots. Can be given multiple times.
   -T, --main-title <string>           Main title for the PDF.
   -h, -?, --help                      This help message.

Assessing our full-depth calls with this utility is detailed in the next code-block.

# analysing the chr21 subset of our bwa-calls
 
# take the trainer data
infolder=Final_Results/bcftools_samtools_variants
 
# save results in the session result folder
outfolder=bcftools_samtools_variants
mkdir -p ${outfolder}
 
############
# aln calls
bcftools_0.2.0 stats ${infolder}/chr21_NA18507_aln_var.flt-D1000.vcf.gz > \
	${outfolder}/chr21_aln_var.check
 
plot-vcfstats_0.2.0 -t "NA18507_aln-calls" \
	-p ${outfolder}/chr21_aln_plots/ \
	${outfolder}/chr21_aln_var.check
 
############
# mem calls
bcftools_0.2.0 stats ${infolder}/chr21_NA18507_mem_var.flt-D1000.vcf.gz > \
	${outfolder}/chr21_mem_var.check
 
plot-vcfstats_0.2.0 -t "NA18507_mem-calls" \
	 -p ${outfolder}/chr21_mem_plots/ \
	 ${outfolder}/chr21_mem_var.check

Handicon.png Open PDF files using evince

If we prefer PNG files to PDF we can use ImageMagick already introduced in a former exercise. A custom pdf2png function has been prepared for you.

the pdf2png custom bash function

pdf2png ()
{
# convert PDF to PNG (all pages with suffix) using ImageMagick.
# take PDF file from 1
# use PDF file name for PNG with extension swap
# or take PNG file name from $2
 
if [ $# -ge 1 ]
then
    convert -density 300 -depth 8 -quality 100 -resize 25% ${1} ${2:-${1%%.pdf}.png};
else
    echo "usage: pdf2png <input.pdf> <opt:output.png>"
fi
}

The summary statistics derived from the VCF file obtained with an identical command from the original hg18 data are reported below.

hg18 variants PDF
plot-vcfcheck_hg18-0.png
plot-vcfcheck_hg18-1.png
plot-vcfcheck_hg18-2.png
plot-vcfcheck_hg18-3.png

The summary statistics derived from the VCF files indicate the overall quality of the mapping experiment and are reported below.

hg19 variants (aln) PDF (mem) PDF
plot-vcfcheck_bwa_aln-0.png
plot-vcfcheck_bwa_mem-0.png
plot-vcfcheck_bwa_aln-1.png
plot-vcfcheck_bwa_mem-1.png
plot-vcfcheck_bwa_aln-2.png
plot-vcfcheck_bwa_mem-2.png
plot-vcfcheck_bwa_aln-3.png
plot-vcfcheck_bwa_mem-3.png

The inspection of the resulting *summary.pdf* files shows a ts/tv ratio of 2.00 for the *aln* method against 1.91 for the *mem* method. In human genome, transitions are expected twice more frequent than transversions (ts/tv~2). This suggests that the *aln* method is closer to the expectation and reports more true calls. A ts/tv ratio of 2.4 is often reported as optimal by NGS experts.

download exercise files

Download exercise files here

Use the right application to open the files present in ex5-files

References:
  1. http://www.1000genomes.org
  2. http://en.wikipedia.org/wiki/Variant_Call_Format
  3. Daniel C Koboldt, Ken Chen, Todd Wylie, David E Larson, Michael D McLellan, Elaine R Mardis, George M Weinstock, Richard K Wilson, Li Ding
    VarScan: variant detection in massively parallel sequencing of individual and pooled samples.
    Bioinformatics: 2009, 25(17);2283-5
    [PubMed:19542151] ##WORLDCAT## [DOI] (I p)


[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.4 | NGS Exercise.5b | NGS Exercise.6 ]