[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.5 | NGS-Var2020 Exercise.7 ]

Annotate and filter VCF variant lists with SnpEff and SNPSift

1 Choose the right tool to enrich your VCF data
2 Annotate your variants with SnpEff
3 Explore the snpEff_summary.html report
4 Compress and index the VCF data
5 Add variant type information
6 Add dbSNP information for known variants
7 Filter and select relevant data from a VCF file with SnpSift
8 Extract minimal information from a Filtered VCF
9 Apply a more complex filter
10 download exercise files

Choose the right tool to enrich your VCF data

A growing number of tools are available to annotate and select from VCF files. The choice of the best tool for your application depends on several factors.

when you need the job done and do not worry about the flexibility, we advise to use SnpEff and the companion SnpSift which are both easy to use java programs.

if you wish to add annotations from third-party databases that are not present in the other tools, or if you work on a organism absent from the above tool, you may consider using Annovar that was included in our former training session ([1]).

Another rising tool from the Broad Institute called Funcotator is not yet really mature but should develop into a nice tool in the near future. You can access the page of the Funcotator Beta version here

when you only need to annotate a few VCF rows, you are welcome to use public servers like:
- the EnsEMBL VEP server (introduced in our related Wiki page [2])
- the UCSC VAI ([3])
- the SeatleSeq server ([4])

Submitting 'patentable' information to the WEB infringes the novelty clause and will expose patient information to the internet, and the size of input is limited to few 100's lines

other tools have been used with success like vcfCodingSnps ([5])

Annotate your variants with SnpEff

start the SnpEff module and link to the intersect VCF file with varscan as first set

run the tool and wait for results

Explore the snpEff_summary.html report

The report can be found HERE

A second file (snpEff_genes.txt') was produced that can be open in excel and provides the counts of each different variant type for each gene. A transpose content for the first gene is reproduced below

#GeneName	A4GALT	A4GALT
GeneId	A4GALT	A4GALT
TranscriptId	NM_001318038.1	NM_017436.5
BioType	protein_coding	protein_coding
variants_impact_HIGH	0	0
variants_impact_LOW	2	2
variants_impact_MODERATE	1	1
variants_impact_MODIFIER	139	139
variants_effect_3_prime_UTR_variant	1	1
variants_effect_5_prime_UTR_premature_start_codon_gain_variant	0	0
variants_effect_5_prime_UTR_variant	0	1
variants_effect_conservative_inframe_deletion	0	0
variants_effect_conservative_inframe_insertion	0	0
variants_effect_disruptive_inframe_deletion	0	0
variants_effect_disruptive_inframe_insertion	0	0
variants_effect_downstream_gene_variant	18	18
variants_effect_frameshift_variant	0	0
variants_effect_intron_variant	110	106
variants_effect_missense_variant	1	1
variants_effect_non_coding_transcript_exon_variant	0	0
variants_effect_non_coding_transcript_variant	0	0
variants_effect_sequence_feature	0	0
variants_effect_splice_acceptor_variant	0	0
variants_effect_splice_donor_variant	0	0
variants_effect_splice_region_variant	0	0
variants_effect_stop_gained	0	0
variants_effect_structural_interaction_variant	0	0
variants_effect_synonymous_variant	2	2
variants_effect_upstream_gene_variant	10	13

Compress and index the VCF data

Use the BgzipAndTabindex module

Add variant type information

Use the SnpSift.varType module

splaisan@gbw-l-m0172:~/Downloads$ zgrep -v "^#" SnpEff_GATK_HG001_recalibrated_vartypes.vcf.gz | head -1
chr22	10510212	.	A	T	48.28	VQSRTrancheSNP99.80to99.90	AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=25.00;QD=24.14;SOR=2.303;VQSLOD=-2.123e+01;culprit=MQ;SNP;HOM;VARTYPE=SNP;ANN=T|intergenic_region|MODIFIER|CHR_START-LOC102723780|CHR_START-LOC102723780|intergenic_region|CHR_START-LOC102723780|||n.10510212A>T||||||	GT:AD:DP:GQ:PL	1/1:0,2:2:6:60,6,0

Add dbSNP information for known variants

If you did not select the known SNP annotation when you ran GATK.GermlineGenotyper, you can do this now using SNPSift as shown below.

The results will be the addition of the rsID in field #3 as well as the annotation of the dbSNP build used for this in the INFO field #8.

# using GATK.GermlineGenotyper, only rs229526 was added in the 3rd field
chr22 37185382 rs229526 G C 3803.6 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.404e+00;DB;DP=267;ExcessHet=3.0103;FS=2.701;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;POSITIVE_TRAIN_SITE;QD=14.25;ReadPosRankSum=7.000e-03;SOR=0.536;VQSLOD=18.47;culprit=MQ;ANN=C|missense_variant|MODERATE|C1QTNF6|C1QTNF6|transcript|NM_031910.3|protein_coding|2/3|c.125C>G|p.Pro42Arg|202/2936|125/837|42/278||,C|missense_variant|MODERATE|C1QTNF6|C1QTNF6|transcript|NM_182486.1|protein_coding|2/4|c.125C>G|p.Pro42Arg|202/1652|125/837|42/278|| GT:AD:DP:GQ:PL 0/1:132,135:267:99:3811,0,3760

Use the SnpSift.Annotate module

# the VCF file gets an extra line describing the annotation process

##SnpSiftCmd="SnpSift Annotate /opt/genepattern/users/.cache/uploads/cache/data.gp.vib.be/pub/snpdb/Homo_sapiens.dbSNP_146.UCSC.hg38.vcf.gz /opt/genepattern/jobResults/1508/GATK_variants.recalibrated_VQSR.vcf.gz"

# after SnpSift.Annotate, the version of the dbSNP database can also be found for each row in the VCF (here: dbSNPBuildID=79)
chr22 37185382 rs229526 G C 3803.6 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.404e+00;DB;DP=267;ExcessHet=3.0103;FS=2.701;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;POSITIVE_TRAIN_SITE;QD=14.25;ReadPosRankSum=7.000e-03;SOR=0.536;VQSLOD=18.47;culprit=MQ;ANN=C|missense_variant|MODERATE|C1QTNF6|C1QTNF6|transcript|NM_031910.3|protein_coding|2/3|c.125C>G|p.Pro42Arg|202/2936|125/837|42/278||,C|missense_variant|MODERATE|C1QTNF6|C1QTNF6|transcript|NM_182486.1|protein_coding|2/4|c.125C>G|p.Pro42Arg|202/1652|125/837|42/278||;ASP;CAF=0.8109,0.1891,.;COMMON=1;G5;GENEINFO=C1QTNF6:114904;GNO;HD;KGPhase1;KGPhase3;MTP;NSM;PM;PMC;REF;RS=229526;RSPOS=37185382;RV;SAO=0;SLO;SSR=0;VC=SNV;VLD;VP=0x050128000a0515053f020100;WGT=1;dbSNPBuildID=79 GT:AD:DP:GQ:PL 0/1:132,135:267:99:3811,0,3760

Note that these new values can be filtered on using SnpSift filter introduced below

Filter and select relevant data from a VCF file with SnpSift

SnpEff has added a large number of annotations and scores which allow us filter the data and find loci of interest based on our assumptions. Filtering is done using SnpSift, the companion tool of SnpEff

Find out more about the syntax for SnpSift queries on the SnpSift page

example queries:

Q1: (ANN[*].EFFECT has 'missense_variant') (should same today but when more than one genome => on 'any' of them)
Q2: (ANN[*].GENE = 'CLTCL1')
Q3: ANN[*].IMPACT = 'HIGH'
Q4: (ANN[*].GENE = 'CLTCL1') & ( ANN[*].IMPACT = 'HIGH' )
Q5: ( CHROM = 'chr22' ) & ( POS > 15000000 ) & ( POS < 20000000 )
Q6: ( ( REF == 'C' ) & ( ALT =='T' ) ) | ( ( REF == 'G' ) & ( ALT =='A' ) ) : Potential EMS variants (alkylation reaction puts a ethyl group on Guanine residues, leading to G:C=>A:T transitions Wikipedia)
Q7: (( CHROM = 'chr22' ) & ( POS > 15000000 ) & ( POS < 20000000 ) ) & ( ANN[*].IMPACT = 'HIGH' )

A little terminal excursion and we get the counts

grep -c -v "^#" Q*.vcf
Q1.vcf:247
Q2.vcf:117
Q3.vcf:28
Q4.vcf:0
Q5.vcf:12784
Q6.vcf:25418
Q7.vcf:5

Error creating thumbnail: Unable to save thumbnail to destination

And many more possibilities by combining the queries

more examples

GEN[*].GT = '0/1'
VARTYPE = 'SNP' (only available when you have added this info above using SnpSift.varType)

Error creating thumbnail: Unable to save thumbnail to destination

OR (|) and NOT (!) can also be used with appropriate '()' grouping

Extract minimal information from a Filtered VCF

We can extract chosen information from VCF files with another SnpSift command. Here is an example of extraction from the query results of

start the SnpSift.extractFields module and the HIGH impact filtered dataset

the results should look like this

CHROM	POS	ID	REF	ALT	DP	EFF[0].AA	ANN[0].GENE	LOF
chr22	10960433		T	G	564		LOC102723780
chr22	19524402	rs885985	G	A	255	p.Gln37*	CLDN5
chr22	19763250	rs41298814	T	C	245		TBX1
chr22	19962712	rs4633	C	T	243		COMT
chr22	19963684	rs4818	C	G	242		COMT
chr22	19963748	rs4680	G	A	234		COMT
chr22	21009455	rs2075276	T	C	237		TUBA3FP
chr22	21183417	rs9754324	T	G	61		FAM230B
chr22	24621069	rs56214106	C	T	242		GGT1
chr22	24728176	rs73152579	C	T	352		PIWIL3	PIWIL3\|4\|0.50)
chr22	26625493	rs5761637	T	C	204		CRYBA4
chr22	30461461	rs5749104	A	G	221		SEC14L3
chr22	30468623	rs4820853	A	G	244		SEC14L3
chr22	31267856	rs2228619	C	G	246		LIMK2
chr22	36191797		ACT	A	286	p.Glu111fs	APOL4	APOL4\|2\|1.00)
chr22	36967952	rs112317095;rs60821613	T	TTTTATTTA	308		LL22NC01-81G9.3
chr22	37226775	rs1064498	A	G	183		RAC2
chr22	38961581	rs1065183	T	C	41		APOBEC3A
chr22	41155035	rs20552	T	A	282		EP300
chr22	41940168	rs5758511	G	A	238	p.Arg3*	CENPM
chr22	42387064	rs34963472	G	A	234	p.Arg189*	NFAM1
chr22	44606929	rs12152184	A	G	201		LINC00229
chr22	49922288	rs758608747	CCTCAGCAGTCAGGACCGGCCTCTCCGATTCTTACCCG	C	226	p.Val198_Pro208delinsAla	CRELD2
chr22	50255868	rs1129880	A	G	235		MAPK12
chr22	50266232	rs2076139	T	C	241		MAPK11
chr22	50522253	rs140524	C	T	258		NCAPH2	NCAPH2\|3\|0.67)
chr22	50525807	rs11479	G	A	343		TYMP
chr22_KI270928v1_alt	64853		T	TC	190	p.Glu160fs	CYP2D6	CYP2D6.7\|1\|1.00)

Note that some variants have content in the LOF column; This column reports the fraction of alternate spliced transcripts that carry the variant and can be a good think to consider when you filter variants.

When not all splicing isoforms carry a mutation, what is the consequence on the gene function?

Apply a more complex filter

novel (no rsID in the 3rd column after applying 'annotate' dbSNP138) HET (from genotype field GT) SNP (after applying 'varType')

( CHROM = 'chr22' ) & ( GEN[*].GT = '0/1' ) & ( ANN[*].IMPACT = 'HIGH' )
to have only novel mutant as compared to dbSNP146, you would add & ( ! exists ID )

Now extract the fields as above but from the output of the new filtered data

CHROM	POS	REF	ALT	ID	DP	EFF[0].AA	ANN[0].GENE	LOF	GEN[*].GT	ANN[0].EFFECT
chr22	10960433	T	G		564		LOC102723780		0/1	splice_acceptor_variant&intron_variant
chr22	19524402	G	A	rs885985	255	p.Gln37*	CLDN5		0/1	stop_gained
chr22	19763250	T	C	rs41298814	245		TBX1		0/1	structural_interaction_variant
chr22	19962712	C	T	rs4633	243		COMT		0/1	structural_interaction_variant
chr22	19963684	C	G	rs4818	242		COMT		0/1	structural_interaction_variant
chr22	19963748	G	A	rs4680	234		COMT		0/1	structural_interaction_variant
chr22	21009455	T	C	rs2075276	237		TUBA3FP		0/1	splice_acceptor_variant&intron_variant
chr22	24621069	C	T	rs56214106	242		GGT1		0/1	structural_interaction_variant
chr22	24728176	C	T	rs73152579	352		PIWIL3	PIWIL3\|4\|0.50)	0/1	splice_donor_variant&intron_variant
chr22	30461461	A	G	rs5749104	221		SEC14L3		0/1	structural_interaction_variant
chr22	30468623	A	G	rs4820853	244		SEC14L3		0/1	structural_interaction_variant
chr22	36967952	T	TTTTATTTA	rs112317095;rs60821613	308		LL22NC01-81G9.3		0/1	splice_acceptor_variant&intron_variant
chr22	37226775	A	G	rs1064498	183		RAC2		0/1	structural_interaction_variant
chr22	41155035	T	A	rs20552	282		EP300		0/1	structural_interaction_variant
chr22	41940168	G	A	rs5758511	238	p.Arg3*	CENPM		0/1	stop_gained
chr22	42387064	G	A	rs34963472	234	p.Arg189*	NFAM1		0/1	stop_gained
chr22	44606929	A	G	rs12152184	201		LINC00229		0/1	splice_donor_variant&intron_variant
chr22	50266232	T	C	rs2076139	241		MAPK11		0/1	structural_interaction_variant
chr22	50522253	C	T	rs140524	258		NCAPH2	NCAPH2\|3\|0.67)	0/1	splice_donor_variant&intron_variant
chr22	50525807	G	A	rs11479	343		TYMP		0/1	structural_interaction_variant

download exercise files

Download exercise files here

Use the right application to open the files present in ex6-files

References:

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2020 | NGS-Var2020 Exercise.5 | NGS-Var2020 Exercise.7 ]

NGS-Var2020 Exercise.6

Contents

Choose the right tool to enrich your VCF data

Annotate your variants with SnpEff

Explore the snpEff_summary.html report

Compress and index the VCF data

Add variant type information

Add dbSNP information for known variants

Filter and select relevant data from a VCF file with SnpSift

Extract minimal information from a Filtered VCF

Apply a more complex filter

download exercise files

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox