Hands-on introduction to NGS variant analysis-2020

[ Main_Page | NGS_data_analysis ]
# two-days training 2020/01/20-21 session

This session is repeat of the 2018 GenePattern training (Hands-on_introduction_to_NGS_variant_analysis-2018)

Credits: All GenePattern modules used today have been created by Guy Bottu (VIB Bioinformatics and Nucleomics Core)

1 Aims of the NGS DNA variant analysis 2-days session
2 Summary
3 Prerequisites
- 3.1 Skills required to follow this training
- 3.2 Software and Arguments used during this training
4 Hands-On Exercises
5 Answers to your 2017 requests
6 PDF version of the former training
7 Find more tools and answers
8 Extra readings
9 GATK performance in numbers
10 Conclusion & Contact

Aims of the NGS DNA variant analysis 2-days session

Using a full publicly available chromosome read-set from one of the 1000 genomes^[1] samples:

Use the graphical environment provided by the BITS GenePattern server to perform all steps of a classical NGS variant workflow and feel the complexity of the task.
Perform a simplified analysis workflow including read mapping, variant calling against the current human reference genome
Annotate the obtained variant calls and compare them to the public variant file for that genome.
implicitly: get the motivation to go to the next level and learn command line and R

Scheme of today's workflow

More BITS Training Info

On the VIB website: http://www.vib.be/en/training/research-training/courses/Pages/Hands-On-introduction-to-NGS-variant-analysis.aspx
On the BITS website: https://www.bits.vib.be/index.php/training/201-ngs-variant-analysis

Summary

This training gives an introduction to the use of popular NGS analysis software packages through the GenePattern graphical interface. It reviews several exchangeable tools and provides hints to evaluate quality and content of Genome-Seq data. Much more can be (and should be) done when working at command-line and GenePattern will not replace advanced use of the terminal. However, this simplified workflow will allow unexperienced scientists to discover the practices involved with variant analysis. A recent review by Geraldine Van der Auwera et al develops on many aspects of this ^[2].

The sequencing data used in this session was obtained from high coverage depth Illumina HiSeq sequencing of gDNA extracted from EBV-transformed B-lymphocytes of a healthy CEPH/UTAH Mormon mother (NA12878). More information about that sample is available from the Coriell repository from which all 1000g gDNA can be obtained ^[3].

The paired end reads used in this training were extracted from the Genome In A Bottle (GIAB) novoalign BAM mappings of Illumina HiSeq 300x reads for NA12878 available on the ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/ server ^[4]. This pre-processing step is not covered in the training. A lot of analysis results are available on the dedicated GIAB page ^[5]

We use this year the state of the art Genome Analysis Toolkit ('GATK') for variant analysis and we strongly advise you to consider it in your work. GATK was compared to varscan used in former training sessions and is clearly superior in sensitivity and specificity ^[6]. GATK is today the standard workflow for human genome analyses and is referred in most publications. We do not apply here all the available tricks GATK allows but rather present you a streamline pipeline with a single genome that will be a good foundation for most applications. Please read the GATK pages^[7] for more information.

The official GATK best practice inspiring this training is as follows:

Translated into separate exercises from today's training (red numbers next to boxes)

^[8]

Disclaimer: This training does not cover all currently available methods. It does not aim at bringing you to a professional NGS analyst level but provides enough information to allow a motivated biologist understanding what DNA sequencing practically is, and when necessary to communicate knowingly with NGS experts for more in-depth needs.

Prerequisites

Skills required to follow this training

Linux command line basic skills are required to review some of the long text results under terminal (GenePattern is not handy in reviewing data, it is mainly a job submission and management platform)
basic knowledge of human genome structure and nomenclature is necessary to understand the training tasks and the results
basic knowledge of Illumina NGS read structure is also required for the same reason

Software and Arguments used during this training

GATK has a new home since January 2020 at: https://gatk.broadinstitute.org/hc/en-us

Please refer to this page for details on the GATK4 command-line syntax

All programs used in this training session were installed on the BITS GenePattern server and specific modules where created for you by Guy Bottu (BITS). These modules are therefore absent from other GenePattern servers like those accessible at the Broad. If you plan to build your own GenePattern Server (GP home page), you may ask Guy copies of these modules contact us.

The GATK4 Best Practices are a series of modules developed at the Broad to analyse sequencing data and are subject to regular changes.

At the time of preparing this training we obtained information from this link

The information used to build the Genepattern modules for this training were obtained from the link shown above and pointing to the Best Practices Wiki pages and the corresponding optional arguments extracted from the WDL scripts posted there.

It is likely that these parameters will change at some point in the future when the Broad team decides so

Hands-On Exercises

REM: At the end of each page, a link to data available on our file server is added and allows downloading some of the data for local use and self-training. Other files, required for the training are present on the server and will be accessed directly from within Genepattern.

NGS-Var2020 Startup GenePattern: Locate data and tools in your training account

NGS-Var2020 Exercise.1: QC paired end reads using fastQC
NGS-Var2020 Exercise.2: Map to the human reference genome GRCh38 using the Burrow Wheeler Aligner (BWA) and recalibrate the mappings with Picard and GATK4
NGS-Var2020 Exercise.3: Perform QC on the mapping data with Picard, and Qualimap
NGS-Var2020 Exercise.4: Call variants and produce genotypes with customised Genepattern modules based on the GATK best practices
NGS-Var2020 Exercise.5: Intersect raw VCF files using VCF_intersect
NGS-Var2020 Exercise.6: Annotate VCF variant lists with SNPSift and filter candidates with SnpEff
NGS-Var2020 Exercise.7: Review read alignments, gene annotations, and variants calls in IGV

Answers to your 2017 requests

Some of you asked about the possibility to call variants from structured experiments like family trios or tumor-normal pairs. We found trio data for one of the public 1000 genome trio (CEPH CEU) as well as a tumor-normal pancreatic cancer dataset (WES) with which we prepared two walk-through tutorial for command-line analysis using Varscan2. Both documents have been linked below.

call variants from a family trio (NA12878) and identify inherited and potential non-inherited variants (de-novo) using Varscan 2 (link)
call variants from a pair of samples (tumor and normal tissues of the same patient) and identify somatic variants and CNA (copy number abnormalities) in the tumor sample (link)

Please contact us for more info and to report inconsistencies in these documents

PDF version of the former training

A partial PDF export of these pages can be downloaded here
A LatEx tutorial to perform a similar analysis at command-line can be found here

Find more tools and answers

There are many tools out there, finding them is often the easiest part. You are welcome to try as many as you wish and improve results obtained with our selected toolbox. When seeking advice, please consider using:

SeqAnswers^[9] for all what relates to NGS
BioStar^[10] for questions about biocomputing and scripting for biologists
stackoverflow^[11] for questions related to coding.

bioinformatics.ca directory^[12] to find bioinformatics tools sorted by categories.

The EBI online training propose many very nice training sessions with slides and exercises.

explain SAM BAM flags^[13] (mirrored at BITS)

A number of reference files are used to run GATK. They are present on our Genepattern server but if you want to run GATK on your own machine you will need to get these files. Among others, the GATK Bundle can be accessed here

GATK4 should also run on multicore machines using the built-in SPARK system. There will at some point in time become a separate documentation HERE about it

Another recent BMC Bioinformatics paper ^[14] reviews ways to accelerate your pipeline.

Extra readings

Other packages have been compared including mappers and callers, here is a starter for more readings, you will find more on Google.

A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference ^[15]
Performance Assessment of Variant Calling Pipelines using Human Whole Exome Sequencing and Simulated data ^[16]
VCF-Miner: GUI-based application for mining variants and annotations stored in VCF files ^[17]
Evaluating Variant Calling Tools for Non-Matched Next-Generation Sequencing Data ^[18]
appreci8: a pipeline for precise variant calling integrating 8 tools ^[19]

What about longer variants (structural!)?

Accurate detection of complex structural variations using single-molecule sequencing ^[20]

What about CNV and RNASeq variants?

GATK4 is developing new BestPractices including
- variant calling on RNAseq
- Somatic CNV
- and more, please look at the GATK pages for more info and resources.

The GATK4 home will soon move to https://software.broadinstitute.org/gatk/^[21]

GATK performance in numbers

A recent nature biotechnology paper by Poplin et al ^[22] compared GATK to other callers. Their results concerning GATK (version 3.8) are pictured below from the first two tables.

A more recent comparison comparing GATK4, varscan, and Strelka2 ^[23] was published by Chen et al in 2019 ^[24] but is not detailed here

Conclusion & Contact

Your feedback to this introductory NGS variant analysis using GenePattern session is very important to us and will be used to improve this content for later sessions. If you need more training of this kind, please contact us and we will organise additional hands-on based on your requests. More advance sessions will depend on the availability of expert users within VIB that will accept to prepare specified material.

contact us

References:

↑ http://www.1000genomes.org
↑
Geraldine A Van der Auwera, Mauricio O Carneiro, Christopher Hartl, Ryan Poplin, Guillermo Del Angel, Ami Levy-Moonshine, Tadeusz Jordan, Khalid Shakir, David Roazen, Joel Thibault, Eric Banks, Kiran V Garimella, David Altshuler, Stacey Gabriel, Mark A DePristo
From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.
Curr Protoc Bioinformatics: 2013, 43(1110);11.10.1-11.10.33
[PubMed:25431634] ##WORLDCAT## [DOI] (I p)
↑ http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM12878
↑ ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/NHGRI_Illumina300X_novoalign_bams/
↑ http://jimb.stanford.edu/giab-resources/
↑
Charles D Warden, Aaron W Adamson, Susan L Neuhausen, Xiwei Wu
Detailed comparison of two popular variant calling packages for exome and targeted exon studies.
PeerJ: 2014, 2;e600
[PubMed:25289185] ##WORLDCAT## [DOI] (P e)
↑ https://gatk.broadinstitute.org/hc/en-us)
↑ https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145
↑ http://seqanswers.com SeqAnswers
↑ http://www.biostars.org BioStar
↑ http://stackoverflow.com stackoverflow
↑ http://bioinformatics.ca/links_directory
↑ https://broadinstitute.github.io/picard/explain-flags.html
↑
Jacob R Heldenbrand, Saurabh Baheti, Matthew A Bockol, Travis M Drucker, Steven N Hart, Matthew E Hudson, Ravishankar K Iyer, Michael T Kalmbach, Katherine I Kendig, Eric W Klee, Nathan R Mattson, Eric D Wieben, Mathieu Wiepert, Derek E Wildman, Liudmila S Mainzer
Recommendations for performance optimizations when using GATK3.8 and GATK4.
BMC Bioinformatics: 2019, 20(1);557
[PubMed:31703611] ##WORLDCAT## [DOI] (I e)
↑ https://www.researchgate.net/publication/283954255_A_Comparison_of_Variant_Calling_Pipelines_Using_Genome_in_a_Bottle_as_a_Reference
↑ https://www.biorxiv.org/content/early/2018/06/29/359109)
↑
Steven N Hart, Patrick Duffy, Daniel J Quest, Asif Hossain, Mike A Meiners, Jean-Pierre Kocher
VCF-Miner: GUI-based application for mining variants and annotations stored in VCF files.
Brief Bioinform: 2016, 17(2);346-51
[PubMed:26210358] ##WORLDCAT## [DOI] (I p)
↑
Sarah Sandmann, Aniek O de Graaf, Mohsen Karimi, Bert A van der Reijden, Eva Hellström-Lindberg, Joop H Jansen, Martin Dugas
Evaluating Variant Calling Tools for Non-Matched Next-Generation Sequencing Data.
Sci Rep: 2017, 7;43169
[PubMed:28233799] ##WORLDCAT## [DOI] (I e)
↑
Sarah Sandmann, Mohsen Karimi, Aniek O de Graaf, Christian Rohde, Stefanie Göllner, Julian Varghese, Jan Ernsting, Gunilla Walldin, Bert A van der Reijden, Carsten Müller-Tidow, Luca Malcovati, Eva Hellström-Lindberg, Joop H Jansen, Martin Dugas
appreci8: a pipeline for precise variant calling integrating 8 tools.
Bioinformatics: 2018, 34(24);4205-4212
[PubMed:29945233] ##WORLDCAT## [DOI] (I p)
↑
Fritz J Sedlazeck, Philipp Rescheneder, Moritz Smolka, Han Fang, Maria Nattestad, Arndt von Haeseler, Michael C Schatz
Accurate detection of complex structural variations using single-molecule sequencing.
Nat Methods: 2018, 15(6);461-468
[PubMed:29713083] ##WORLDCAT## [DOI] (I p)
↑ https://software.broadinstitute.org/gatk/
↑
Ryan Poplin, Pi-Chuan Chang, David Alexander, Scott Schwartz, Thomas Colthurst, Alexander Ku, Dan Newburger, Jojo Dijamco, Nam Nguyen, Pegah T Afshar, Sam S Gross, Lizzie Dorfman, Cory Y McLean, Mark A DePristo
A universal SNP and small-indel variant caller using deep neural networks.
Nat Biotechnol: 2018, 36(10);983-987
[PubMed:30247488] ##WORLDCAT## [DOI] (I p)
↑
Sangtae Kim, Konrad Scheffler, Aaron L Halpern, Mitchell A Bekritsky, Eunho Noh, Morten Källberg, Xiaoyu Chen, Yeonbin Kim, Doruk Beyter, Peter Krusche, Christopher T Saunders
Strelka2: fast and accurate calling of germline and somatic variants.
Nat Methods: 2018, 15(8);591-594
[PubMed:30013048] ##WORLDCAT## [DOI] (I p)
↑
Jiayun Chen, Xingsong Li, Hongbin Zhong, Yuhuan Meng, Hongli Du
Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers.
Sci Rep: 2019, 9(1);9345
[PubMed:31249349] ##WORLDCAT## [DOI] (I e)

[ Main_Page ]

[1] ttp://www.1000genomes.org

[2] 
Geraldine A Van der Auwera, Mauricio O Carneiro, Christopher Hartl, Ryan Poplin, Guillermo Del Angel, Ami Levy-Moonshine, Tadeusz Jordan, Khalid Shakir, David Roazen, Joel Thibault, Eric Banks, Kiran V Garimella, David Altshuler, Stacey Gabriel, Mark A DePristo
From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.
Curr Protoc Bioinformatics: 2013, 43(1110);11.10.1-11.10.33
[PubMed:25431634] ##WORLDCAT## [DOI] (I p)

[3] ttp://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM12878

[4] tp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/NHGRI_Illumina300X_novoalign_bams/

[5] ttp://jimb.stanford.edu/giab-resources/

[6] 
Charles D Warden, Aaron W Adamson, Susan L Neuhausen, Xiwei Wu
Detailed comparison of two popular variant calling packages for exome and targeted exon studies.
PeerJ: 2014, 2;e600
[PubMed:25289185] ##WORLDCAT## [DOI] (P e)

[7] ttps://gatk.broadinstitute.org/hc/en-us)

[8] ttps://software.broadinstitute.org/gatk/best-practices/workflow?id=11145

[9] ttp://seqanswers.com SeqAnswers

[10] ttp://www.biostars.org BioStar

[11] ttp://stackoverflow.com stackoverflow

[12] ttp://bioinformatics.ca/links_directory

[13] ttps://broadinstitute.github.io/picard/explain-flags.html

[14] 
Jacob R Heldenbrand, Saurabh Baheti, Matthew A Bockol, Travis M Drucker, Steven N Hart, Matthew E Hudson, Ravishankar K Iyer, Michael T Kalmbach, Katherine I Kendig, Eric W Klee, Nathan R Mattson, Eric D Wieben, Mathieu Wiepert, Derek E Wildman, Liudmila S Mainzer
Recommendations for performance optimizations when using GATK3.8 and GATK4.
BMC Bioinformatics: 2019, 20(1);557
[PubMed:31703611] ##WORLDCAT## [DOI] (I e)

[15] ttps://www.researchgate.net/publication/283954255_A_Comparison_of_Variant_Calling_Pipelines_Using_Genome_in_a_Bottle_as_a_Reference

[16] ttps://www.biorxiv.org/content/early/2018/06/29/359109)

[17] 
Steven N Hart, Patrick Duffy, Daniel J Quest, Asif Hossain, Mike A Meiners, Jean-Pierre Kocher
VCF-Miner: GUI-based application for mining variants and annotations stored in VCF files.
Brief Bioinform: 2016, 17(2);346-51
[PubMed:26210358] ##WORLDCAT## [DOI] (I p)

[18] 
Sarah Sandmann, Aniek O de Graaf, Mohsen Karimi, Bert A van der Reijden, Eva Hellström-Lindberg, Joop H Jansen, Martin Dugas
Evaluating Variant Calling Tools for Non-Matched Next-Generation Sequencing Data.
Sci Rep: 2017, 7;43169
[PubMed:28233799] ##WORLDCAT## [DOI] (I e)

[19] 
Sarah Sandmann, Mohsen Karimi, Aniek O de Graaf, Christian Rohde, Stefanie Göllner, Julian Varghese, Jan Ernsting, Gunilla Walldin, Bert A van der Reijden, Carsten Müller-Tidow, Luca Malcovati, Eva Hellström-Lindberg, Joop H Jansen, Martin Dugas
appreci8: a pipeline for precise variant calling integrating 8 tools.
Bioinformatics: 2018, 34(24);4205-4212
[PubMed:29945233] ##WORLDCAT## [DOI] (I p)

[20] 
Fritz J Sedlazeck, Philipp Rescheneder, Moritz Smolka, Han Fang, Maria Nattestad, Arndt von Haeseler, Michael C Schatz
Accurate detection of complex structural variations using single-molecule sequencing.
Nat Methods: 2018, 15(6);461-468
[PubMed:29713083] ##WORLDCAT## [DOI] (I p)

[21] ttps://software.broadinstitute.org/gatk/

[22] 
Ryan Poplin, Pi-Chuan Chang, David Alexander, Scott Schwartz, Thomas Colthurst, Alexander Ku, Dan Newburger, Jojo Dijamco, Nam Nguyen, Pegah T Afshar, Sam S Gross, Lizzie Dorfman, Cory Y McLean, Mark A DePristo
A universal SNP and small-indel variant caller using deep neural networks.
Nat Biotechnol: 2018, 36(10);983-987
[PubMed:30247488] ##WORLDCAT## [DOI] (I p)

[23] 
Sangtae Kim, Konrad Scheffler, Aaron L Halpern, Mitchell A Bekritsky, Eunho Noh, Morten Källberg, Xiaoyu Chen, Yeonbin Kim, Doruk Beyter, Peter Krusche, Christopher T Saunders
Strelka2: fast and accurate calling of germline and somatic variants.
Nat Methods: 2018, 15(8);591-594
[PubMed:30013048] ##WORLDCAT## [DOI] (I p)

[24] 
Jiayun Chen, Xingsong Li, Hongbin Zhong, Yuhuan Meng, Hongli Du
Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers.
Sci Rep: 2019, 9(1);9345
[PubMed:31249349] ##WORLDCAT## [DOI] (I e)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

Hands-on introduction to NGS variant analysis-2020

Contents

Aims of the NGS DNA variant analysis 2-days session

Summary

Prerequisites

Skills required to follow this training

Software and Arguments used during this training

Hands-On Exercises

Answers to your 2017 requests

PDF version of the former training

Find more tools and answers

Extra readings

GATK performance in numbers

Conclusion & Contact

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox