NGS-Var2020 Exercise.4

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.3 | NGS-Var2020 Exercise.5 ]


Call variants against the human reference genome (hg38) with GATK.GermlineSNPsAndIndelsCaller and produce genotypes with GATK.GermlineGenotyper


ex04_wf.png

Handicon.png A separate wikibook page can be consulted about variant calling from NGS: <http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/DNA_Variants>

Several methods are available to convert read mapping information to variant calls. They are all based on local pileup alignment to the reference genome and consensus extraction but different callers use different strategies to limit the effect of wrongly mapped or low quality read sequences on the final result. We used here GATK4 to call SNVs and small indels. Previous training sessions made use of Varscan and bcftools to align and call variants. This year, we switched to the generally accepted gold standard caller developed at the Broad institute. Please refer to other Wiki pages for information on how to use the other callers.

Preliminary info

quick intro on VCF format

Since the expansion of the 1000 genome project[1], the Variant Call Format has become more and more popular and is today the default format to represent sequence variation. VCF is a tabular text format that provides rich information about each position different from the reference genome. It also includes different scores obtained during sequencing, alignment, and calling to allow quality filtering as well as added sequence annotations to allow annotation-driven filtering. Various tools have been developed to manipulate VCF data and are exemplified during this session.

An example of VCF data is provided here as a primer, users will get more information in NGS-formats or from the VCF documentation <http://en.wikipedia.org/wiki/Variant_Call_Format>[2].

Handicon.png As today several VCF subversion are used, please refer to the best documentation for your own application from this page

Handicon.png Currently, VCF version 4.2 is the latest build, read more about its secrets here


vcf_fig1.png
vcf_fig2.png

About analysing variants in multiple 'related' genomes

When you want to analyse variant in more than one genome (family structure of patient cohort) and compare the genomes with each other, you should ideally call the variants in a single job in order to make the difference between no-calls (absence of read support information) and REF-calls (the genome is not mutated at this position as supported by reads).

This becomes rapidly impossible when you deal with cohorts of patients/controls and another solution had to be found in order to combines studies or call from multiple genomes. The solution is to produce an intermediate format called gVCF which not only stores the variant calls but also REF calls and No-Calls. Each sample gets the full information stored in the gVCF files, which can be combined in a second step (genotyping) after processing all relevant genomes individually.

Handicon.png Broad page dedicated to explaining gVCF: https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf


gvcf_fig1.png

Call variants with GATK.GermlineSNPsAndIndelsCaller

Technical.png Inorder to focus on chr22 only we provide an extra file with all chr22 sequences from the reference or only the assembled chr22

These files can be written in GATK list format or in BED format (tab-separated) as here

  • example of a list format for a sub-region of chr22 (chr22_15M_20M.list)
chr22:15000000-20000000
  • All chr22 sequences from the HG38 reference in BED format (chr22_all.bed)
chr22	0	50818468
chr22_KI270731v1_random	0	150754
chr22_KI270732v1_random	0	41543
chr22_KI270733v1_random	0	179772
chr22_KI270734v1_random	0	165050
chr22_KI270735v1_random	0	42811
chr22_KI270736v1_random	0	181920
chr22_KI270737v1_random	0	103838
chr22_KI270738v1_random	0	99375
chr22_KI270739v1_random	0	73985
chr22_KI270875v1_alt	0	259914
chr22_KI270878v1_alt	0	186262
chr22_KI270879v1_alt	0	304135
chr22_KI270876v1_alt	0	263666
chr22_KI270877v1_alt	0	101331
chr22_GL383583v2_alt	0	96924
chr22_GL383582v2_alt	0	162811
chr22_KB663609v1_alt	0	74013
chr22_KI270928v1_alt	0	176103
  • only the assembled chr22 from HG38 in BED format (chr22_only.bed)
chr22	0	50818468
  • start the module and link the latest BAM file and the hg38 reference
ex4_01.png
ex4_02.png
  • wait for the results
ex4_03.png

 

Convert the gVCF data to conventional VCF format with GATK.GermlineGenotyper

Handicon.png Because we use a 1-10% sample from a single chromosome, the VQSR recalibration is not applicable to our data. Only when the input is consequent, VQSR can be applied and produce improved call quality scores

Technical.png If you add known variants here they will be tagged in the 3rd column of the VCF data. If you do not do it now, no problem, you can add them later with SnpSift

  • start the module and link the latest BAM file and the hg38 reference
ex4_04.png
ex4_05.png
  • leave the variant set for recalibration on 'none' and use instead quality filtering from the next field with default cutoffs
 QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SQR > 3.0 || QUAL < 30
  • wait for the results
ex4_06.png

 

VQSR - Recalibration reports from the full data Run

Handicon.png VQSR is a very important step to obtain high quality variant calls. For the adventurous doing this at command-line, please read more on this GATK page

Handicon.png When available, the VQSR reports can be accessed from the PDF file produced by the run

VQSR results from the full run

ex4_07.png

Handicon.png VQSR gaussian results are 'explained' HERE

 

Handicon.png Our variants are ready as far as GATK is concerned and should now be gene-annotated and user-filtered to identify candidate disease/phenotype-related candidates.

Compress and index the VCF data

Downstream tools require the VCF data to be sorted by coordinate, compressed, and indexed so that random search can be operate very fast based on coordinate queries.

  • We produce this compressed and indexed version of the VCF data with the module BgzipAndTabixindex as shown next
ex4_08.png
ex4_09.png

download exercise files

Download exercise files here

Use the right application to open the files present in 1pc_ex4-files
Use the right application to open the files present in 100pc_ex4-files

References:
  1. http://www.1000genomes.org
  2. http://en.wikipedia.org/wiki/Variant_Call_Format

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.3 | NGS-Var2020 Exercise.5 ]