NGS-Var2020 Exercise.5

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.4 | NGS-Var2020 Exercise.6 ]


Intersect VCF files


ex05_wf.png

Handicon.png From here on, we use data obtained from the full-read set as performance and speed are no issues anymore for the class

Because the 1000g sample used today is well known, we could download a gold-standard set of variants for the chr22 of that sample and placed it for you on the FTP 0_reference folder

Use the VCF_compare module

We will use this tool to identify the part of GATK calls that are unique to the hard-filtering, shared by both GATK methods, and common with the TRUTH dataset obtained from the 1000genome project pages

  • produce gzipped and indexed versions of the VCF files to be compared (use for that the module: BgzipAndTabixindex [not shown here])
  • start the VCF_compare with the gold-standard NA12878_HG001-chr22 subset and both GATK compressed VCF and indices
ex5_03.png
  • start the job and wait for results
ex5_04.png

Technical.png both GATK files are identical as a matter of variants calls, the difference between them (besides the different scores) lies in col-7 containing either PASS or . for accepted variants vs 'disqualified' variants with col-7 reporting the reason(s) for their failure

Technical.png Only PASS and . variants in col-7 have been compared using the defaults settings filter Yes - These are the trusted calls

Variant counts per type in col-7

# variant counts per type (col-7) in GATK_variants-VQSR-100pc.vcf.gz
 
  74627 PASS
    110 VQSRTrancheINDEL99.90to99.95
    610 VQSRTrancheINDEL99.95to100.00
   4524 VQSRTrancheSNP99.80to99.90
   1448 VQSRTrancheSNP99.90to99.95
   7737 VQSRTrancheSNP99.95to100.00
 
# variant counts per type (col-7) in GATK_variants-hardfilter-100pc.vcf.gz
 
  67782 PASS
    542 IndelQualFilter1
      8 IndelQualFilter2
      1 IndelQualFilter4
   1796 SnpQualFilter1
     14 SnpQualFilter1;SnpQualFilter2
      8 SnpQualFilter1;SnpQualFilter2;SnpQualFilter3
    115 SnpQualFilter1;SnpQualFilter2;SnpQualFilter3;SnpQualFilter6
      1 SnpQualFilter1;SnpQualFilter2;SnpQualFilter4
     50 SnpQualFilter1;SnpQualFilter2;SnpQualFilter4;SnpQualFilter6
      1 SnpQualFilter1;SnpQualFilter2;SnpQualFilter5
      1 SnpQualFilter1;SnpQualFilter2;SnpQualFilter5;SnpQualFilter6
    539 SnpQualFilter1;SnpQualFilter2;SnpQualFilter6
   1619 SnpQualFilter1;SnpQualFilter3
      1 SnpQualFilter1;SnpQualFilter3;SnpQualFilter4
      2 SnpQualFilter1;SnpQualFilter3;SnpQualFilter5
    539 SnpQualFilter1;SnpQualFilter3;SnpQualFilter6
    430 SnpQualFilter1;SnpQualFilter4
      9 SnpQualFilter1;SnpQualFilter4;SnpQualFilter5
      4 SnpQualFilter1;SnpQualFilter4;SnpQualFilter5;SnpQualFilter6
    216 SnpQualFilter1;SnpQualFilter4;SnpQualFilter6
      3 SnpQualFilter1;SnpQualFilter5
    542 SnpQualFilter1;SnpQualFilter6
    104 SnpQualFilter2
    142 SnpQualFilter2;SnpQualFilter3
      3 SnpQualFilter2;SnpQualFilter3;SnpQualFilter4
     18 SnpQualFilter2;SnpQualFilter3;SnpQualFilter4;SnpQualFilter6
      1 SnpQualFilter2;SnpQualFilter3;SnpQualFilter5;SnpQualFilter6
    772 SnpQualFilter2;SnpQualFilter3;SnpQualFilter6
     28 SnpQualFilter2;SnpQualFilter4
      1 SnpQualFilter2;SnpQualFilter4;SnpQualFilter5;SnpQualFilter6
    350 SnpQualFilter2;SnpQualFilter4;SnpQualFilter6
      1 SnpQualFilter2;SnpQualFilter5
      4 SnpQualFilter2;SnpQualFilter5;SnpQualFilter6
    691 SnpQualFilter2;SnpQualFilter6
   7131 SnpQualFilter3
    112 SnpQualFilter3;SnpQualFilter4
      1 SnpQualFilter3;SnpQualFilter4;SnpQualFilter5
      7 SnpQualFilter3;SnpQualFilter4;SnpQualFilter6
     12 SnpQualFilter3;SnpQualFilter5
   1862 SnpQualFilter3;SnpQualFilter6
   2720 SnpQualFilter4
     51 SnpQualFilter4;SnpQualFilter5
      3 SnpQualFilter4;SnpQualFilter5;SnpQualFilter6
    157 SnpQualFilter4;SnpQualFilter6
     25 SnpQualFilter5
    637 SnpQualFilter6
  • open the comparison file to see the text results and the PDF file to see the Venn plot
ex5_05.png

Specificity vs Completeness

  • HardFiltering find 6 gold variants not found by VQSR at the cost of 3141 noisy calls
  • VQSR find 160 gold variants not found by Hard-Filtering at the cost of 9410 noisy calls
  • both GATK methods call 23855 variants unknown to the gold set BUT they also find 40589 gold variants out of 42190 in the gold set (96.2%)

Use the VCF_intersect module

You can also use this other module to identify which variants are part of GATK calls shared between two files (eg: which are the gold variants found by GATK)

  • start the VCF_intersect with both GATK variants and the gold-standard NA12878_HG001-chr22 subset
ex5_01.png
  • start the job and wait for results
ex5_02.png

Handicon.png The content of the intersect file is the part of File#1 overlapped in File#2 and therefore contains all fields from File#1

Handicon.png When relevant, you can run the tool a second time and permute the two datasets

Handicon.png You can also intersect an intersection with a third file to get triple-intersects

The shared variants are now identified and can be annotated and screened for putative causative mutations as done in the next exercise

Rapid excursion in the terminal

Optional paragraph to illustrate the higher flexibility of the command-line

Handicon.png the following operations are not yet possible in Genepattern

A couple of terminal commands allow us to count the calls in the three files and build a VENN diagram.

To be more fair, we consider here only the calls made by both GATK methods and the counts unique to the gold standard, ignoring the GATK-specific counts on the left and right sides of the Venn above.

Gold_only=1435
GATK_only=23855
Intersect=40589
 
# using a custom script: https://github.com/BITS-VIB/venn-tools
2DVenn.R -a 1435 -b 23855 -i 40589 -A "Gold-standard" -B "GATK4_shared" -x 1 -o recall -u 2
2DVenn.R -a 1435 -b 23855 -i 40589 -A "Gold-standard" -B "GATK4_shared" -x 1 -o recall_pc -u 2 -P 1
recall.png
recall_pc.png

From the numbers one can compute various metrics as explained HERE (not done here but often seen in variant reports)

Simply put we can say that:

  • 40589/(40589+1435) = 96.59% of the gold variants are present in the shared-GATK data (GATK is quite sensitive)
  • 40589/(40589+23855) = 62.98% of the shared-GATK variants are gold variants (GATK was not really specific)

Users generally accept to find a number of False Positive calls in their results as a cost to also get most of the True positive variants. This is what we observe here and that motivate the use of GATK by most of today's scientists.

download exercise files

Download exercise files here

Use the right application to open the files present in ex5-files

References:

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.4 | NGS-Var2020 Exercise.6 ]