NGS-Var2018 Exercise.5

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2018 | NGS-Var2018 Exercise.4 | NGS-Var2018 Exercise.6 ]


Intersect VCF files


ex05_wf.png
Error creating thumbnail: Unable to save thumbnail to destination
From here on, we use data obtained from the full-read set as performance and speed are no issues anymore for the class

Because the 1000g sample used today is well known, we could download a gold-standard set of variants for the chr22 of that sample and placed it for you on the FTP 0_reference folder

Use the VCF_intersect module

We will use this file to identify the part of GATK calls that are known (and accepted as true HQ variants)

  • start the VCF_intersect with both GATK variants and the gold-standard NA12878_HG001-chr22 subset
ex5_01.png
  • start the job and wait for results
ex5_02.png
Error creating thumbnail: Unable to save thumbnail to destination
The content of the intersect file is the part of File#1 overlapped in File#2 and therefore contains all fields from File#1
Error creating thumbnail: Unable to save thumbnail to destination
When relevant, you can run the tool a second time and permute the two datasets

The shared variants are now identified and can be annotated and screened for putative causative mutations as done in the next exercise

Rapid excursion in the terminal

Error creating thumbnail: Unable to save thumbnail to destination
the following operations are not yet possible in Genepattern

A few terminal commands allow us to count the calls in the three files and build a VENN diagram

zgrep -c -v "^#" NA12878_HG001-chr22_gold.vcf.gz
42190
 
zgrep -c -v "^#" GATK_variants-full.recalibrated.vcf.gz
88875
 
grep -c -v "^#" GATK_variants-full.recalibrated.vcf_isec_NA12878_HG001-chr22_gold.vcf.vcf 
40791
 
echo "42190-40791" | bc
1399
 
echo "88875-40791" | bc
48084
 
# using a custom script: https://github.com/BITS-VIB/venn-tools
2DVenn.R -a 48084 -b 1399 -i 40791 -A "GATK4" -B "Gold-standard" -x 1 -o recall -u 2
2DVenn.R -a 48084 -b 1399 -i 40791 -A "GATK4" -B "Gold-standard" -x 1 -o recall_pc -u 2 -P 1
recall.png
recall_pc.png

From the numbers one can compute various metrics as explained HERE (not done here but often seen in variant reports)

Simply put we can say that:

  • 96.68% of the gold variants are present in the GATK data (GATK is quite sensitive)
  • 45.89% of the GATK variants are gold variants (GATK was not really specific)

Users generally accept to find a number of False Positive calls in their results as a cost to also get most of the True positive variants. This is what we observe here and that motivate the use of GATK by most of today's scientists.

download exercise files

Download exercise files here

Use the right application to open the files present in ex5-files

References:

[ Main_Page ]