NGS-Var2018 Exercise.5
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2018 | NGS-Var2018 Exercise.4 | NGS-Var2018 Exercise.6 ]
Intersect VCF files
Because the 1000g sample used today is well known, we could download a gold-standard set of variants for the chr22 of that sample and placed it for you on the FTP 0_reference folder
- compressed VCF: https://data.bits.vib.be/pub/trainingen/NGSVAR2018/0_reference/NA12878_HG001-chr22_gold.vcf.gz
- matching tabix index: https://data.bits.vib.be/pub/trainingen/NGSVAR2018/0_reference/NA12878_HG001-chr22_gold.vcf.gz.tbi
Use the VCF_intersect module
We will use this file to identify the part of GATK calls that are known (and accepted as true HQ variants)
- start the VCF_intersect with both GATK variants and the gold-standard NA12878_HG001-chr22 subset
- start the job and wait for results
The shared variants are now identified and can be annotated and screened for putative causative mutations as done in the next exercise
Rapid excursion in the terminal
A few terminal commands allow us to count the calls in the three files and build a VENN diagram
zgrep -c -v "^#" NA12878_HG001-chr22_gold.vcf.gz 42190 zgrep -c -v "^#" GATK_variants-full.recalibrated.vcf.gz 88875 grep -c -v "^#" GATK_variants-full.recalibrated.vcf_isec_NA12878_HG001-chr22_gold.vcf.vcf 40791 echo "42190-40791" | bc 1399 echo "88875-40791" | bc 48084 # using a custom script: https://github.com/BITS-VIB/venn-tools 2DVenn.R -a 48084 -b 1399 -i 40791 -A "GATK4" -B "Gold-standard" -x 1 -o recall -u 2 2DVenn.R -a 48084 -b 1399 -i 40791 -A "GATK4" -B "Gold-standard" -x 1 -o recall_pc -u 2 -P 1
From the numbers one can compute various metrics as explained HERE (not done here but often seen in variant reports)
Simply put we can say that:
- 96.68% of the gold variants are present in the GATK data (GATK is quite sensitive)
- 45.89% of the GATK variants are gold variants (GATK was not really specific)
Users generally accept to find a number of False Positive calls in their results as a cost to also get most of the True positive variants. This is what we observe here and that motivate the use of GATK by most of today's scientists.
download exercise files
Download exercise files here
References:
[ Main_Page ]