NGS-Var2017 Exercise.4

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2017 | NGS-Var2017 Exercise.3 | NGS-Var2017 Exercise.5 ]


Call variants against the human reference genome (hg19) with [ samtools | bcftools ] or [ samtools | varscan ]


ex04_wf.png
Error creating thumbnail: Unable to save thumbnail to destination
A separate wikibook page can be consulted about variant calling from NGS(<http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/DNA_Variants>)

Several methods are available to convert read mapping information to variant calls. They are all based on local pileup alignment to the reference genome and consensus extraction but different callers use different strategies to limit the effect of wrongly mapped or low quality read sequences on the final result. We used here samtools with bcftools or varscan to call SNVs and small indels.

Preliminary info

quick intro on VCF format

Since the expansion of the 1000 genome project[1], the Variant Call Format has become more and more popular and is today the default format to represent sequence variation. VCF is a tabular text format that provides rich information about each position different from the reference genome. It also includes different scores obtained during sequencing, alignment, and calling to allow quality filtering as well as added sequence annotations to allow annotation-driven filtering. Various tools have been developed to manipulate VCF data and are exemplified during this session.

An example of VCF data is provided here as a primer, users will get more information in NGS-formats or from the VCF documentation <http://en.wikipedia.org/wiki/Variant_Call_Format>[2].

About bgzip-compressed VCF data and indexing

Similarly to SAM vs BAM and in order to speed VCF compatible tools, raw VCF data needs to be sorted and indexed. We created for this purpose a simple module BgzipAndTabixindex  

About analyzing variants in multiple 'related' genomes

When you want to analyze variant in more than one genome (family structure of patient cohort) and compare the genomes with each other, you HOULD call the variants in a single job in order to make the difference between no-calls (absence of read support information) and REF-calls (the genome is not mutated at this position as supported by reads).

In order to do this kind of analysis, just submit all your BAM files to the caller at once by adding them in the module to the first BAM file one after the other (drag them on top of the file input dialog). The resulting VCF file will show additional columns to the right for the additional genomes. The column names for the analyzed genomes in your VCF files will be Sample1, Sample2, ... in the same order you added the BAM files. This is valid for both caller modules used in this session.

Call variants with samtools and bcftools (both using htslib)

This method is historically the longest in use and is still valid for most applications. We will use here the 10% sample to avoid saturating the server.

  • start the module BCFtools.Call_variants and link the latest BAM file and the hg19 reference
ex4_01.png
  • set the parameters as shown
ex4_02.png
  • run, wait and inspect resluts
ex4_03.png

 

Call variants with samtools and varscan

While Samtools calls variants using a probabilistic approach, Varscan2 [3] instead uses a more basic approach of filtering out what users consider as background. Several cutoffs can be defined to filter-out 'bad' mappings and keep 'good' ones for calling.

  • start the module Varscan and link the latest BAM file and the hg19 reference
ex4_04.png
  • set the parameters as shown
ex4_05.png
  • wait for the results
ex4_06.png

 

Compress and index the obtained VCF file(s) with BgzipAndTabixindex

 

Error creating thumbnail: Unable to save thumbnail to destination
Do the following for one of the available 10%-sample VCF files. We will use in the next step the full data stored on the server

 

  • start the module BgzipAndTabixindex and link the VCF file
ex4_07.png
  • set the parameters as shown and run
ex4_08.png

Perform QC on the obtained VCF files using bcftools

As always in NGS, fresh data should always be QC'ed to avoid any surprise and intercept errors and biases asap. Tools performing quality control of VCF data are sparse and include Bcftools stats (not available on the GenePattern server).

We will get some QC info from the next exercise.

 

download exercise files

Download exercise files here

Use the right application to open the files present in ex4-files

References:
  1. http://www.1000genomes.org
  2. http://en.wikipedia.org/wiki/Variant_Call_Format
  3. Daniel C Koboldt, Ken Chen, Todd Wylie, David E Larson, Michael D McLellan, Elaine R Mardis, George M Weinstock, Richard K Wilson, Li Ding
    VarScan: variant detection in massively parallel sequencing of individual and pooled samples.
    Bioinformatics: 2009, 25(17);2283-5
    [PubMed:19542151] ##WORLDCAT## [DOI] (I p)


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2017 | NGS-Var2017 Exercise.3 | NGS-Var2017 Exercise.5 ]