NGS-Var2020 Exercise.5
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.4 | NGS-Var2020 Exercise.6 ]
Intersect VCF files
From here on, we use data obtained from the full-read set as performance and speed are no issues anymore for the class
Because the 1000g sample used today is well known, we could download a gold-standard set of variants for the chr22 of that sample and placed it for you on the FTP 0_reference folder
- compressed VCF: https://data.bits.vib.be/pub/trainingen/NGSVAR2020/0_reference/NA12878_HG001-chr22_gold.vcf.gz
- matching tabix index: https://data.bits.vib.be/pub/trainingen/NGSVAR2020/0_reference/NA12878_HG001-chr22_gold.vcf.gz.tbi
Contents
Use the VCF_compare module
We will use this tool to identify the part of GATK calls that are unique to the hard-filtering, shared by both GATK methods, and common with the TRUTH dataset obtained from the 1000genome project pages
- produce gzipped and indexed versions of the VCF files to be compared (use for that the module: BgzipAndTabixindex [not shown here])
- start the VCF_compare with the gold-standard NA12878_HG001-chr22 subset and both GATK compressed VCF and indices
- start the job and wait for results
both GATK files are identical as a matter of variants calls, the difference between them (besides the different scores) lies in col-7 containing either PASS or . for accepted variants vs 'disqualified' variants with col-7 reporting the reason(s) for their failure
Only PASS and . variants in col-7 have been compared using the defaults settings filter Yes - These are the trusted calls
Variant counts per type in col-7
# variant counts per type (col-7) in GATK_variants-VQSR-100pc.vcf.gz 74627 PASS 110 VQSRTrancheINDEL99.90to99.95 610 VQSRTrancheINDEL99.95to100.00 4524 VQSRTrancheSNP99.80to99.90 1448 VQSRTrancheSNP99.90to99.95 7737 VQSRTrancheSNP99.95to100.00 # variant counts per type (col-7) in GATK_variants-hardfilter-100pc.vcf.gz 67782 PASS 542 IndelQualFilter1 8 IndelQualFilter2 1 IndelQualFilter4 1796 SnpQualFilter1 14 SnpQualFilter1;SnpQualFilter2 8 SnpQualFilter1;SnpQualFilter2;SnpQualFilter3 115 SnpQualFilter1;SnpQualFilter2;SnpQualFilter3;SnpQualFilter6 1 SnpQualFilter1;SnpQualFilter2;SnpQualFilter4 50 SnpQualFilter1;SnpQualFilter2;SnpQualFilter4;SnpQualFilter6 1 SnpQualFilter1;SnpQualFilter2;SnpQualFilter5 1 SnpQualFilter1;SnpQualFilter2;SnpQualFilter5;SnpQualFilter6 539 SnpQualFilter1;SnpQualFilter2;SnpQualFilter6 1619 SnpQualFilter1;SnpQualFilter3 1 SnpQualFilter1;SnpQualFilter3;SnpQualFilter4 2 SnpQualFilter1;SnpQualFilter3;SnpQualFilter5 539 SnpQualFilter1;SnpQualFilter3;SnpQualFilter6 430 SnpQualFilter1;SnpQualFilter4 9 SnpQualFilter1;SnpQualFilter4;SnpQualFilter5 4 SnpQualFilter1;SnpQualFilter4;SnpQualFilter5;SnpQualFilter6 216 SnpQualFilter1;SnpQualFilter4;SnpQualFilter6 3 SnpQualFilter1;SnpQualFilter5 542 SnpQualFilter1;SnpQualFilter6 104 SnpQualFilter2 142 SnpQualFilter2;SnpQualFilter3 3 SnpQualFilter2;SnpQualFilter3;SnpQualFilter4 18 SnpQualFilter2;SnpQualFilter3;SnpQualFilter4;SnpQualFilter6 1 SnpQualFilter2;SnpQualFilter3;SnpQualFilter5;SnpQualFilter6 772 SnpQualFilter2;SnpQualFilter3;SnpQualFilter6 28 SnpQualFilter2;SnpQualFilter4 1 SnpQualFilter2;SnpQualFilter4;SnpQualFilter5;SnpQualFilter6 350 SnpQualFilter2;SnpQualFilter4;SnpQualFilter6 1 SnpQualFilter2;SnpQualFilter5 4 SnpQualFilter2;SnpQualFilter5;SnpQualFilter6 691 SnpQualFilter2;SnpQualFilter6 7131 SnpQualFilter3 112 SnpQualFilter3;SnpQualFilter4 1 SnpQualFilter3;SnpQualFilter4;SnpQualFilter5 7 SnpQualFilter3;SnpQualFilter4;SnpQualFilter6 12 SnpQualFilter3;SnpQualFilter5 1862 SnpQualFilter3;SnpQualFilter6 2720 SnpQualFilter4 51 SnpQualFilter4;SnpQualFilter5 3 SnpQualFilter4;SnpQualFilter5;SnpQualFilter6 157 SnpQualFilter4;SnpQualFilter6 25 SnpQualFilter5 637 SnpQualFilter6
- open the comparison file to see the text results and the PDF file to see the Venn plot
Specificity vs Completeness
- HardFiltering find 6 gold variants not found by VQSR at the cost of 3141 noisy calls
- VQSR find 160 gold variants not found by Hard-Filtering at the cost of 9410 noisy calls
- both GATK methods call 23855 variants unknown to the gold set BUT they also find 40589 gold variants out of 42190 in the gold set (96.2%)
Use the VCF_intersect module
You can also use this other module to identify which variants are part of GATK calls shared between two files (eg: which are the gold variants found by GATK)
- start the VCF_intersect with both GATK variants and the gold-standard NA12878_HG001-chr22 subset
- start the job and wait for results
The content of the intersect file is the part of File#1 overlapped in File#2 and therefore contains all fields from File#1
The shared variants are now identified and can be annotated and screened for putative causative mutations as done in the next exercise
Rapid excursion in the terminal
Optional paragraph to illustrate the higher flexibility of the command-line
A couple of terminal commands allow us to count the calls in the three files and build a VENN diagram.
To be more fair, we consider here only the calls made by both GATK methods and the counts unique to the gold standard, ignoring the GATK-specific counts on the left and right sides of the Venn above.
Gold_only=1435 GATK_only=23855 Intersect=40589 # using a custom script: https://github.com/BITS-VIB/venn-tools 2DVenn.R -a 1435 -b 23855 -i 40589 -A "Gold-standard" -B "GATK4_shared" -x 1 -o recall -u 2 2DVenn.R -a 1435 -b 23855 -i 40589 -A "Gold-standard" -B "GATK4_shared" -x 1 -o recall_pc -u 2 -P 1
From the numbers one can compute various metrics as explained HERE (not done here but often seen in variant reports)
Simply put we can say that:
- 40589/(40589+1435) = 96.59% of the gold variants are present in the shared-GATK data (GATK is quite sensitive)
- 40589/(40589+23855) = 62.98% of the shared-GATK variants are gold variants (GATK was not really specific)
Users generally accept to find a number of False Positive calls in their results as a cost to also get most of the True positive variants. This is what we observe here and that motivate the use of GATK by most of today's scientists.
download exercise files
Download exercise files here
References:
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.4 | NGS-Var2020 Exercise.6 ]