NGS-Var2017 Exercise.4
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2017 | NGS-Var2017 Exercise.3 | NGS-Var2017 Exercise.5 ]
Call variants against the human reference genome (hg19) with [ samtools | bcftools ] or [ samtools | varscan ]
Several methods are available to convert read mapping information to variant calls. They are all based on local pileup alignment to the reference genome and consensus extraction but different callers use different strategies to limit the effect of wrongly mapped or low quality read sequences on the final result. We used here samtools with bcftools or varscan to call SNVs and small indels.
Contents
Preliminary info
quick intro on VCF format
Since the expansion of the 1000 genome project[1], the Variant Call Format has become more and more popular and is today the default format to represent sequence variation. VCF is a tabular text format that provides rich information about each position different from the reference genome. It also includes different scores obtained during sequencing, alignment, and calling to allow quality filtering as well as added sequence annotations to allow annotation-driven filtering. Various tools have been developed to manipulate VCF data and are exemplified during this session.
An example of VCF data is provided here as a primer, users will get more information in NGS-formats or from the VCF documentation <http://en.wikipedia.org/wiki/Variant_Call_Format>[2].
About bgzip-compressed VCF data and indexing
Similarly to SAM vs BAM and in order to speed VCF compatible tools, raw VCF data needs to be sorted and indexed. We created for this purpose a simple module BgzipAndTabixindex
When you want to analyze variant in more than one genome (family structure of patient cohort) and compare the genomes with each other, you HOULD call the variants in a single job in order to make the difference between no-calls (absence of read support information) and REF-calls (the genome is not mutated at this position as supported by reads).
In order to do this kind of analysis, just submit all your BAM files to the caller at once by adding them in the module to the first BAM file one after the other (drag them on top of the file input dialog). The resulting VCF file will show additional columns to the right for the additional genomes. The column names for the analyzed genomes in your VCF files will be Sample1, Sample2, ... in the same order you added the BAM files. This is valid for both caller modules used in this session.
Call variants with samtools and bcftools (both using htslib)
This method is historically the longest in use and is still valid for most applications. We will use here the 10% sample to avoid saturating the server.
- start the module BCFtools.Call_variants and link the latest BAM file and the hg19 reference
- set the parameters as shown
- run, wait and inspect resluts
Call variants with samtools and varscan
While Samtools calls variants using a probabilistic approach, Varscan2 [3] instead uses a more basic approach of filtering out what users consider as background. Several cutoffs can be defined to filter-out 'bad' mappings and keep 'good' ones for calling.
- start the module Varscan and link the latest BAM file and the hg19 reference
- set the parameters as shown
- wait for the results
Compress and index the obtained VCF file(s) with BgzipAndTabixindex
- start the module BgzipAndTabixindex and link the VCF file
- set the parameters as shown and run
Perform QC on the obtained VCF files using bcftools
As always in NGS, fresh data should always be QC'ed to avoid any surprise and intercept errors and biases asap. Tools performing quality control of VCF data are sparse and include Bcftools stats (not available on the GenePattern server).
We will get some QC info from the next exercise.
download exercise files
Download exercise files here
References:
- ↑ http://www.1000genomes.org
- ↑ http://en.wikipedia.org/wiki/Variant_Call_Format
- ↑
Daniel C Koboldt, Ken Chen, Todd Wylie, David E Larson, Michael D McLellan, Elaine R Mardis, George M Weinstock, Richard K Wilson, Li Ding
VarScan: variant detection in massively parallel sequencing of individual and pooled samples.
Bioinformatics: 2009, 25(17);2283-5
[PubMed:19542151] ##WORLDCAT## [DOI] (I p)
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2017 | NGS-Var2017 Exercise.3 | NGS-Var2017 Exercise.5 ]