NGS-Var2020 Exercise.6
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.5 | NGS-Var2020 Exercise.7 ]
Annotate and filter VCF variant lists with SnpEff and SNPSift
Contents
- 1 Choose the right tool to enrich your VCF data
- 2 Annotate your variants with SnpEff
- 3 Explore the snpEff_summary.html report
- 4 Compress and index the VCF data
- 5 Add variant type information
- 6 Add dbSNP information for known variants
- 7 Filter and select relevant data from a VCF file with SnpSift
- 8 Extract minimal information from a Filtered VCF
- 9 Apply a more complex filter
- 10 download exercise files
Choose the right tool to enrich your VCF data
A growing number of tools are available to annotate and select from VCF files. The choice of the best tool for your application depends on several factors.
- when you need the job done and do not worry about the flexibility, we advise to use SnpEff and the companion SnpSift which are both easy to use java programs.
- if you wish to add annotations from third-party databases that are not present in the other tools, or if you work on a organism absent from the above tool, you may consider using Annovar that was included in our former training session ([1]).
- Another rising tool from the Broad Institute called Funcotator is not yet really mature but should develop into a nice tool in the near future. You can access the page of the Funcotator Beta version here
- when you only need to annotate a few VCF rows, you are welcome to use public servers like:
Submitting 'patentable' information to the WEB infringes the novelty clause and will expose patient information to the internet, and the size of input is limited to few 100's lines
- other tools have been used with success like vcfCodingSnps ([5])
Annotate your variants with SnpEff
- start the SnpEff module and link to the intersect VCF file with varscan as first set
- run the tool and wait for results
Explore the snpEff_summary.html report
The report can be found HERE
A second file (snpEff_genes.txt') was produced that can be open in excel and provides the counts of each different variant type for each gene. A transpose content for the first gene is reproduced below
#GeneName | A4GALT | A4GALT |
---|---|---|
GeneId | A4GALT | A4GALT |
TranscriptId | NM_001318038.1 | NM_017436.5 |
BioType | protein_coding | protein_coding |
variants_impact_HIGH | 0 | 0 |
variants_impact_LOW | 2 | 2 |
variants_impact_MODERATE | 1 | 1 |
variants_impact_MODIFIER | 139 | 139 |
variants_effect_3_prime_UTR_variant | 1 | 1 |
variants_effect_5_prime_UTR_premature_start_codon_gain_variant | 0 | 0 |
variants_effect_5_prime_UTR_variant | 0 | 1 |
variants_effect_conservative_inframe_deletion | 0 | 0 |
variants_effect_conservative_inframe_insertion | 0 | 0 |
variants_effect_disruptive_inframe_deletion | 0 | 0 |
variants_effect_disruptive_inframe_insertion | 0 | 0 |
variants_effect_downstream_gene_variant | 18 | 18 |
variants_effect_frameshift_variant | 0 | 0 |
variants_effect_intron_variant | 110 | 106 |
variants_effect_missense_variant | 1 | 1 |
variants_effect_non_coding_transcript_exon_variant | 0 | 0 |
variants_effect_non_coding_transcript_variant | 0 | 0 |
variants_effect_sequence_feature | 0 | 0 |
variants_effect_splice_acceptor_variant | 0 | 0 |
variants_effect_splice_donor_variant | 0 | 0 |
variants_effect_splice_region_variant | 0 | 0 |
variants_effect_stop_gained | 0 | 0 |
variants_effect_structural_interaction_variant | 0 | 0 |
variants_effect_synonymous_variant | 2 | 2 |
variants_effect_upstream_gene_variant | 10 | 13 |
Compress and index the VCF data
- Use the BgzipAndTabindex module
Add variant type information
- Use the SnpSift.varType module
splaisan@gbw-l-m0172:~/Downloads$ zgrep -v "^#" SnpEff_GATK_HG001_recalibrated_vartypes.vcf.gz | head -1 chr22 10510212 . A T 48.28 VQSRTrancheSNP99.80to99.90 AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=25.00;QD=24.14;SOR=2.303;VQSLOD=-2.123e+01;culprit=MQ;SNP;HOM;VARTYPE=SNP;ANN=T|intergenic_region|MODIFIER|CHR_START-LOC102723780|CHR_START-LOC102723780|intergenic_region|CHR_START-LOC102723780|||n.10510212A>T|||||| GT:AD:DP:GQ:PL 1/1:0,2:2:6:60,6,0
Add dbSNP information for known variants
If you did not select the known SNP annotation when you ran GATK.GermlineGenotyper, you can do this now using SNPSift as shown below.
The results will be the addition of the rsID in field #3 as well as the annotation of the dbSNP build used for this in the INFO field #8.
chr22 37185382 rs229526 G C 3803.6 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.404e+00;DB;DP=267;ExcessHet=3.0103;FS=2.701;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;POSITIVE_TRAIN_SITE;QD=14.25;ReadPosRankSum=7.000e-03;SOR=0.536;VQSLOD=18.47;culprit=MQ;ANN=C|missense_variant|MODERATE|C1QTNF6|C1QTNF6|transcript|NM_031910.3|protein_coding|2/3|c.125C>G|p.Pro42Arg|202/2936|125/837|42/278||,C|missense_variant|MODERATE|C1QTNF6|C1QTNF6|transcript|NM_182486.1|protein_coding|2/4|c.125C>G|p.Pro42Arg|202/1652|125/837|42/278|| GT:AD:DP:GQ:PL 0/1:132,135:267:99:3811,0,3760
- Use the SnpSift.Annotate module
##SnpSiftCmd="SnpSift Annotate /opt/genepattern/users/.cache/uploads/cache/data.gp.vib.be/pub/snpdb/Homo_sapiens.dbSNP_146.UCSC.hg38.vcf.gz /opt/genepattern/jobResults/1508/GATK_variants.recalibrated_VQSR.vcf.gz"
# after SnpSift.Annotate, the version of the dbSNP database can also be found for each row in the VCF (here: dbSNPBuildID=79)
chr22 37185382 rs229526 G C 3803.6 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.404e+00;DB;DP=267;ExcessHet=3.0103;FS=2.701;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;POSITIVE_TRAIN_SITE;QD=14.25;ReadPosRankSum=7.000e-03;SOR=0.536;VQSLOD=18.47;culprit=MQ;ANN=C|missense_variant|MODERATE|C1QTNF6|C1QTNF6|transcript|NM_031910.3|protein_coding|2/3|c.125C>G|p.Pro42Arg|202/2936|125/837|42/278||,C|missense_variant|MODERATE|C1QTNF6|C1QTNF6|transcript|NM_182486.1|protein_coding|2/4|c.125C>G|p.Pro42Arg|202/1652|125/837|42/278||;ASP;CAF=0.8109,0.1891,.;COMMON=1;G5;GENEINFO=C1QTNF6:114904;GNO;HD;KGPhase1;KGPhase3;MTP;NSM;PM;PMC;REF;RS=229526;RSPOS=37185382;RV;SAO=0;SLO;SSR=0;VC=SNV;VLD;VP=0x050128000a0515053f020100;WGT=1;dbSNPBuildID=79 GT:AD:DP:GQ:PL 0/1:132,135:267:99:3811,0,3760
Filter and select relevant data from a VCF file with SnpSift
SnpEff has added a large number of annotations and scores which allow us filter the data and find loci of interest based on our assumptions. Filtering is done using SnpSift, the companion tool of SnpEff
Find out more about the syntax for SnpSift queries on the SnpSift page
example queries:
- Q1: (ANN[*].EFFECT has 'missense_variant') (should same today but when more than one genome => on 'any' of them)
- Q2: (ANN[*].GENE = 'CLTCL1')
- Q3: ANN[*].IMPACT = 'HIGH'
- Q4: (ANN[*].GENE = 'CLTCL1') & ( ANN[*].IMPACT = 'HIGH' )
- Q5: ( CHROM = 'chr22' ) & ( POS > 15000000 ) & ( POS < 20000000 )
- Q6: ( ( REF == 'C' ) & ( ALT =='T' ) ) | ( ( REF == 'G' ) & ( ALT =='A' ) ) : Potential EMS variants (alkylation reaction puts a ethyl group on Guanine residues, leading to G:C=>A:T transitions Wikipedia)
- Q7: (( CHROM = 'chr22' ) & ( POS > 15000000 ) & ( POS < 20000000 ) ) & ( ANN[*].IMPACT = 'HIGH' )
A little terminal excursion and we get the counts
grep -c -v "^#" Q*.vcf Q1.vcf:247 Q2.vcf:117 Q3.vcf:28 Q4.vcf:0 Q5.vcf:12784 Q6.vcf:25418 Q7.vcf:5
more examples
- GEN[*].GT = '0/1'
- VARTYPE = 'SNP' (only available when you have added this info above using SnpSift.varType)
Extract minimal information from a Filtered VCF
We can extract chosen information from VCF files with another SnpSift command. Here is an example of extraction from the query results of
- start the SnpSift.extractFields module and the HIGH impact filtered dataset
the results should look like this
CHROM | POS | ID | REF | ALT | DP | EFF[0].AA | ANN[0].GENE | LOF |
---|---|---|---|---|---|---|---|---|
chr22 | 10960433 | T | G | 564 | LOC102723780 | |||
chr22 | 19524402 | rs885985 | G | A | 255 | p.Gln37* | CLDN5 | |
chr22 | 19763250 | rs41298814 | T | C | 245 | TBX1 | ||
chr22 | 19962712 | rs4633 | C | T | 243 | COMT | ||
chr22 | 19963684 | rs4818 | C | G | 242 | COMT | ||
chr22 | 19963748 | rs4680 | G | A | 234 | COMT | ||
chr22 | 21009455 | rs2075276 | T | C | 237 | TUBA3FP | ||
chr22 | 21183417 | rs9754324 | T | G | 61 | FAM230B | ||
chr22 | 24621069 | rs56214106 | C | T | 242 | GGT1 | ||
chr22 | 24728176 | rs73152579 | C | T | 352 | PIWIL3 | PIWIL3|4|0.50) | |
chr22 | 26625493 | rs5761637 | T | C | 204 | CRYBA4 | ||
chr22 | 30461461 | rs5749104 | A | G | 221 | SEC14L3 | ||
chr22 | 30468623 | rs4820853 | A | G | 244 | SEC14L3 | ||
chr22 | 31267856 | rs2228619 | C | G | 246 | LIMK2 | ||
chr22 | 36191797 | ACT | A | 286 | p.Glu111fs | APOL4 | APOL4|2|1.00) | |
chr22 | 36967952 | rs112317095;rs60821613 | T | TTTTATTTA | 308 | LL22NC01-81G9.3 | ||
chr22 | 37226775 | rs1064498 | A | G | 183 | RAC2 | ||
chr22 | 38961581 | rs1065183 | T | C | 41 | APOBEC3A | ||
chr22 | 41155035 | rs20552 | T | A | 282 | EP300 | ||
chr22 | 41940168 | rs5758511 | G | A | 238 | p.Arg3* | CENPM | |
chr22 | 42387064 | rs34963472 | G | A | 234 | p.Arg189* | NFAM1 | |
chr22 | 44606929 | rs12152184 | A | G | 201 | LINC00229 | ||
chr22 | 49922288 | rs758608747 | CCTCAGCAGTCAGGACCGGCCTCTCCGATTCTTACCCG | C | 226 | p.Val198_Pro208delinsAla | CRELD2 | |
chr22 | 50255868 | rs1129880 | A | G | 235 | MAPK12 | ||
chr22 | 50266232 | rs2076139 | T | C | 241 | MAPK11 | ||
chr22 | 50522253 | rs140524 | C | T | 258 | NCAPH2 | NCAPH2|3|0.67) | |
chr22 | 50525807 | rs11479 | G | A | 343 | TYMP | ||
chr22_KI270928v1_alt | 64853 | T | TC | 190 | p.Glu160fs | CYP2D6 | CYP2D6.7|1|1.00) |
Note that some variants have content in the LOF column; This column reports the fraction of alternate spliced transcripts that carry the variant and can be a good think to consider when you filter variants.
Apply a more complex filter
novel (no rsID in the 3rd column after applying 'annotate' dbSNP138) HET (from genotype field GT) SNP (after applying 'varType')
- ( CHROM = 'chr22' ) & ( GEN[*].GT = '0/1' ) & ( ANN[*].IMPACT = 'HIGH' )
- to have only novel mutant as compared to dbSNP146, you would add & ( ! exists ID )
Now extract the fields as above but from the output of the new filtered data
CHROM | POS | REF | ALT | ID | DP | EFF[0].AA | ANN[0].GENE | LOF | GEN[*].GT | ANN[0].EFFECT |
---|---|---|---|---|---|---|---|---|---|---|
chr22 | 10960433 | T | G | 564 | LOC102723780 | 0/1 | splice_acceptor_variant&intron_variant | |||
chr22 | 19524402 | G | A | rs885985 | 255 | p.Gln37* | CLDN5 | 0/1 | stop_gained | |
chr22 | 19763250 | T | C | rs41298814 | 245 | TBX1 | 0/1 | structural_interaction_variant | ||
chr22 | 19962712 | C | T | rs4633 | 243 | COMT | 0/1 | structural_interaction_variant | ||
chr22 | 19963684 | C | G | rs4818 | 242 | COMT | 0/1 | structural_interaction_variant | ||
chr22 | 19963748 | G | A | rs4680 | 234 | COMT | 0/1 | structural_interaction_variant | ||
chr22 | 21009455 | T | C | rs2075276 | 237 | TUBA3FP | 0/1 | splice_acceptor_variant&intron_variant | ||
chr22 | 24621069 | C | T | rs56214106 | 242 | GGT1 | 0/1 | structural_interaction_variant | ||
chr22 | 24728176 | C | T | rs73152579 | 352 | PIWIL3 | PIWIL3|4|0.50) | 0/1 | splice_donor_variant&intron_variant | |
chr22 | 30461461 | A | G | rs5749104 | 221 | SEC14L3 | 0/1 | structural_interaction_variant | ||
chr22 | 30468623 | A | G | rs4820853 | 244 | SEC14L3 | 0/1 | structural_interaction_variant | ||
chr22 | 36967952 | T | TTTTATTTA | rs112317095;rs60821613 | 308 | LL22NC01-81G9.3 | 0/1 | splice_acceptor_variant&intron_variant | ||
chr22 | 37226775 | A | G | rs1064498 | 183 | RAC2 | 0/1 | structural_interaction_variant | ||
chr22 | 41155035 | T | A | rs20552 | 282 | EP300 | 0/1 | structural_interaction_variant | ||
chr22 | 41940168 | G | A | rs5758511 | 238 | p.Arg3* | CENPM | 0/1 | stop_gained | |
chr22 | 42387064 | G | A | rs34963472 | 234 | p.Arg189* | NFAM1 | 0/1 | stop_gained | |
chr22 | 44606929 | A | G | rs12152184 | 201 | LINC00229 | 0/1 | splice_donor_variant&intron_variant | ||
chr22 | 50266232 | T | C | rs2076139 | 241 | MAPK11 | 0/1 | structural_interaction_variant | ||
chr22 | 50522253 | C | T | rs140524 | 258 | NCAPH2 | NCAPH2|3|0.67) | 0/1 | splice_donor_variant&intron_variant | |
chr22 | 50525807 | G | A | rs11479 | 343 | TYMP | 0/1 | structural_interaction_variant |
download exercise files
Download exercise files here
References:
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.5 | NGS-Var2020 Exercise.7 ]