NGS-Var2018 Exercise.4
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2018 | NGS-Var2018 Exercise.3 | NGS-Var2018 Exercise.5 ]
Call variants against the human reference genome (hg38) with GATK.GermlineSNPsAndIndelsCaller and produce genotypes with GATK.GermlineGenotyper
Several methods are available to convert read mapping information to variant calls. They are all based on local pileup alignment to the reference genome and consensus extraction but different callers use different strategies to limit the effect of wrongly mapped or low quality read sequences on the final result. We used here GATK4 to call SNVs and small indels. Previous training sessions made use of Varscan and bcftools to align and call variants. This year, we switched to the generally accepted gold standard caller developed at the Broad institute. Please refer to other Wiki pages for information on how to use the other callers.
Contents
Preliminary info
quick intro on VCF format
Since the expansion of the 1000 genome project[1], the Variant Call Format has become more and more popular and is today the default format to represent sequence variation. VCF is a tabular text format that provides rich information about each position different from the reference genome. It also includes different scores obtained during sequencing, alignment, and calling to allow quality filtering as well as added sequence annotations to allow annotation-driven filtering. Various tools have been developed to manipulate VCF data and are exemplified during this session.
An example of VCF data is provided here as a primer, users will get more information in NGS-formats or from the VCF documentation <http://en.wikipedia.org/wiki/Variant_Call_Format>[2].
When you want to analyse variant in more than one genome (family structure of patient cohort) and compare the genomes with each other, you should ideally call the variants in a single job in order to make the difference between no-calls (absence of read support information) and REF-calls (the genome is not mutated at this position as supported by reads).
This becomes rapidly impossible when you deal with cohorts of patients/controls and another solution had to be found in order to combines studies or call from multiple genomes. The solution is to produce an intermediate format called gVCF which not only stores the variant calls but also REF calls and No-Calls. Each sample gets the full information stored in the gVCF files, which can be combined in a second step (genotyping) after processing all relevant genomes individually.
Call variants with GATK.GermlineSNPsAndIndelsCaller
- start the module and link the latest BAM file and the hg38 reference
- wait for the results
Convert the gVCF data to conventional VCF format with GATK.GermlineGenotyper
- start the module and link the latest BAM file and the hg38 reference
- leave the variant set for recalibration on 'none' and use instead quality filtering from the next field with default cutoffs
QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SQR > 3.0 || QUAL < 30
- wait for the results
VQSR - Recalibration reports
VQSR results from the full run
- GATK_variants-full.SNPs_annotation.R.pdf
- GATK_variants-full.INDELs_annotation.R.pdf
- GATK_variants-full.SNPs.tranches.pdf
Compress and index the VCF data
Downstream tools require the VCF data to be sorted by coordinate, compressed, and indexed so that random search can be operate very fast based on coordinate queries.
- We produce this compressed and indexed version of the VCF data with the module BgzipAndTabixindex as shown next
download exercise files
Download exercise files here
References:
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2018 | NGS-Var2018 Exercise.3 | NGS-Var2018 Exercise.5 ]