NGS-Var2018 Exercise.6

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2018 | NGS-Var2018 Exercise.5 | NGS-Var2018 Exercise.7 ]


Annotate and filter VCF variant lists with SnpEff and SNPSift


SnpEff_logo.png
SnpSift_logo.png

Choose the right tool to enrich your VCF data

A growing number of tools are available to annotate and select from VCF files. The choice of the best tool for your application depends on several factors.

  • when you need the job done and do not worry about the flexibility, we advise to use SnpEff and the companion SnpSift which are both easy to use java programs.
  • if you wish to add annotations from third-party databases that are not present in the other tools, or if you work on a organism absent from the above tool, you may consider using Annovar that was included in our former training session ([1]).
  • when you only need to annotate a few VCF rows, you are welcome to use public servers like:
    • the EnsEMBL VEP server (introduced in our related Wiki page [2])
    • the UCSC VAI ([3])
    • the SeatleSeq server ([4])

Technical.png Submitting 'patentable' information to the WEB infringes the novelty clause and will expose patient information to the internet, and the size of input is limited to few 100's lines

  • other tools have been used with success like vcfCodingSnps ([5])

 

Annotate your variants with SnpEff

  • start the SnpEff module and link to the intersect VCF file with varscan as first set
ex6_01.png
  • run the tool and wait for results
ex6_02.png

Explore the snpEff_summary.html report

The report can be found HERE


ex6_03.png


ex6_04.png


ex6_05.png


ex6_06.png


ex6_07.png

A second file was produced that can be open in excel and provides the counts of each different variant type for each gene. A transpose content for the first gene is reproduced below

#GeneName A4GALT
GeneId A4GALT
TranscriptId NM_001318038.1
BioType protein_coding
variants_impact_HIGH 0
variants_impact_LOW 2
variants_impact_MODERATE 1
variants_impact_MODIFIER 143
variants_effect_3_prime_UTR_variant 1
variants_effect_5_prime_UTR_premature_start_codon_gain_variant 0
variants_effect_5_prime_UTR_variant 0
variants_effect_conservative_inframe_deletion 0
variants_effect_conservative_inframe_insertion 0
variants_effect_disruptive_inframe_deletion 0
variants_effect_disruptive_inframe_insertion 0
variants_effect_downstream_gene_variant 18
variants_effect_frameshift_variant 0
variants_effect_intron_variant 114
variants_effect_missense_variant 1
variants_effect_non_coding_transcript_exon_variant 0
variants_effect_non_coding_transcript_variant 0
variants_effect_sequence_feature 0
variants_effect_splice_acceptor_variant 0
variants_effect_splice_donor_variant 0
variants_effect_splice_region_variant 0
variants_effect_stop_gained 0
variants_effect_structural_interaction_variant 0
variants_effect_synonymous_variant 2
variants_effect_upstream_gene_variant 10


##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=ExcessHet,Description="ExcessHet > 54.69">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=VQSRTrancheINDEL99.90to99.95,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -10.9542 <= x < -7.3811">
##FILTER=<ID=VQSRTrancheINDEL99.95to100.00+,Description="Truth sensitivity tranche level for INDEL model at VQS Lod < -308.8418">
##FILTER=<ID=VQSRTrancheINDEL99.95to100.00,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -308.8418 <= x < -10.9542">
##FILTER=<ID=VQSRTrancheSNP99.80to99.90,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -30.1362 <= x < -17.8303">
##FILTER=<ID=VQSRTrancheSNP99.90to99.95,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -47.5879 <= x < -30.1362">
##FILTER=<ID=VQSRTrancheSNP99.95to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -1138.4944">
##FILTER=<ID=VQSRTrancheSNP99.95to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -1138.4944 <= x < -47.5879">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
...
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=NEGATIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the negative training set of bad variants">
##INFO=<ID=POSITIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the positive training set of good variants">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="Log odds of being a true variant versus being false under the trained gaussian mixture model">
##INFO=<ID=culprit,Number=1,Type=String,Description="The annotation which was the worst performing in the Gaussian mixture model, likely the reason why the variant was filtered out

Compress and index the VCF data

  • Use the BgzipAndTabindex module
ex6_08.png
ex6_09.png

Add variant type information

  • Use the SnpSift.varType module
ex6_10.png
ex6_10b.png
splaisan@gbw-l-m0172:~/Downloads$ zgrep -v "^#" SnpEff_GATK_HG001_recalibrated_vartypes.vcf.gz | head -1
chr22	10510212	.	A	T	48.28	VQSRTrancheSNP99.80to99.90	AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=25.00;QD=24.14;SOR=2.303;VQSLOD=-2.123e+01;culprit=MQ;SNP;HOM;VARTYPE=SNP;ANN=T|intergenic_region|MODIFIER|CHR_START-LOC102723780|CHR_START-LOC102723780|intergenic_region|CHR_START-LOC102723780|||n.10510212A>T||||||	GT:AD:DP:GQ:PL	1/1:0,2:2:6:60,6,0

Add dbSNP information for known variants

If you did not select the known SNP annotation when you ran GATK.GermlineGenotyper, you can do this now using SNPSift as shown below.

The results will be the addition of the rsID in field #3 as well as the annotation of the dbSNP build used for this in the INFO field #8.

# using GATK.GermlineGenotyper, only rs78778839 was added in the 3rd field
chr22   15281327        rs78778839      C       T       1641.6  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=3.83;DB;DP=134;ExcessHet=3.0103;FS=0.671;MLEAC=1;MLEAF=0.500;MQ=35.02;MQRankSum=-1.097e+00;QD=12.25;ReadPosRankSum=-1.935e+00;SOR=0.774;VQSLOD=-1.115e+01;culprit=MQ;SNP;HET;VARTYPE=SNP;ANN=T|intergenic_region|MODIFIER|LOC102723769-OR11H1|LOC102723769-OR11H1|intergenic_region|LOC102723769-OR11H1|||n.15281327C>T||||||   GT:AD:DP:GQ:PL  0/1:71,63:134:99:1649,0,1946

# in addition, the version of the dbSNP database can be found  in the VCF header after --dbsnp but this only indicates the latest data in that file while the SnpSift.annotate saves the version for each SNP
##GATKCommandLine=<ID=GenotypeGVCFs,CommandLine="GenotypeGVCFs  --output GATK_variants.genotyped.vcf --use-new-qual-calculator true --dbsnp /data/genepattern/users/.cache/uploads/cache/data.gp.vib.be/pub/snpdb/Homo_sapiens.dbSNP_138.UCSC.hg38.vcf.gz --variant /data/genepattern/jobResults/2593/HG001_100pc.recalibrated.g.vcf --reference /data/genepattern/users/.cache/uploads/cache/data.gp.vib.be/pub
  • Use the SnpSift.Annotate module
ex6_11.png
ex6_11b.png
# using SnpSift.Annotate: rs78778839 is now found in place of '.' in field #3 and 'dbSNPBuildID=131' was added at the end of the INFO field #8)
chr22   15281327        rs78778839      C       T       1641.6  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=3.83;DB;DP=134;ExcessHet=3.0103;FS=0.671;MLEAC=1;MLEAF=0.500;MQ=35.02;MQRankSum=-1.097e+00;QD=12.25;ReadPosRankSum=-1.935e+00;SOR=0.774;VQSLOD=-1.115e+01;culprit=MQ;SNP;HET;VARTYPE=SNP;ANN=T|intergenic_region|MODIFIER|LOC102723769-OR11H1|LOC102723769-OR11H1|intergenic_region|LOC102723769-OR11H1|||n.15281327C>T||||||;ASP;CAF=[0.8669,0.1331];COMMON=1;KGPROD;KGPhase1;OTHERKG;RS=78778839;RSPOS=16696637;SAO=0;SSR=0;VC=SNV;VP=0x050000000005100016000100;WGT=1;dbSNPBuildID=131       GT:AD:DP:GQ:PL  0/1:71,63:134:99:1649,0,1946

Technical.png Note that these new values can be filtered on using SnpSift filter introduced below

Filter and select relevant data from a VCF file with SnpSift

SnpEff has added a large number of annotations and scores which allow us filter the data and find loci of interest based on our assumptions. Filtering is done using SnpSift, the companion tool of SnpEff

Find out more about the syntax for queries on the SnpSift page


ex6_12.png

example queries:

  • ANN[0].EFFECT has 'missense_variant' (specifically on the first genome, today we only have one - note that the numbering starts with '0')
  • (ANN[*].EFFECT has 'missense_variant') (same today but when more than one genome => on 'any' of them)
  • (ANN[*].GENE = 'CLTCL1')
  • ANN[*].IMPACT = 'HIGH'
  • (ANN[*].GENE = 'CLTCL1') & ( ANN[*].IMPACT = 'HIGH' )
  • ( CHROM = 'chr22' ) & ( POS > 15000000 ) & ( POS < 20000000 )
  • ( ( REF == 'C' ) & ( ALT =='T' ) ) | ( ( REF == 'G' ) & ( ALT =='A' ) ) : Potential EMS variants (alkylation reaction puts a ethyl group on Guanine residues, leading to G:C=>A:T transitions Wikipedia)
  • ( ( CHROM = 'chr22' ) & ( POS > 15000000 ) & ( POS < 20000000 ) ) & ( ANN[*].IMPACT = 'HIGH' )

A little terminal excursion and we get the counts

grep -c -v "^#" *.vcf
SnpEff_GATK_HG001_recalibrated_vartypes.vcf15M_20M.filtered.vcf:12878
SnpEff_GATK_HG001_recalibrated_vartypes.vcf15M_20M_HIGH.filtered.vcf:5
SnpEff_GATK_HG001_recalibrated_vartypes.vcfCLTCL1.filtered.vcf:110
SnpEff_GATK_HG001_recalibrated_vartypes.vcfCLTCL1_HIGH.filtered.vcf:0
SnpEff_GATK_HG001_recalibrated_vartypes.vcfCT_GA.filtered.vcf:26515
SnpEff_GATK_HG001_recalibrated_vartypes.vcfHIGH.filtered.vcf:34
SnpEff_GATK_HG001_recalibrated_vartypes.vcfmissense.filtered.vcf:267

Handicon.png And many more possibilities by combining the queries

more examples

  • GEN[*].GT = '0/1'
  • VARTYPE = 'SNP'

Handicon.png OR (|) and NOT (!) can also be used with appropriate '()' grouping

 

Extract minimal information from a Filtered VCF

We can extract chosen information from VCF files with another SnpSift command. Here is an example of extraction from the query results of

  • start the SnpSift.extractFields module and the HIGH impact filtered dataset
ex6_13.png
ex6_13b.png

the results should look like this

CHROM POS ID REF ALT DP EFF[0].AA EFF[0].GENE LOF
chr22 10960433 . T G 569 . LOC102723780 .
chr22 19524402 rs885985 G A 254 p.Gln37* CLDN5 .
chr22 19763250 rs41298814 T C 245 . TBX1 .
chr22 19962712 rs4633 C T 243 . COMT .
chr22 19963684 rs4818 C G 241 . COMT .
chr22 19963748 rs4680 G A 233 . COMT .
chr22 21009455 rs2075276 T C 236 . TUBA3FP .
chr22 21183417 rs9754324 T G 61 . FAM230B .
chr22 24621069 rs56214106 C T 240 . GGT1 .
chr22 24728176 rs73152579 C T 352 . PIWIL3 PIWIL3\|4\|0.50)
chr22 26625493 rs5761637 T C 206 . CRYBA4 .
chr22 30461461 rs5749104 A G 220 . SEC14L3 .
chr22 30468623 rs4820853 A G 245 . SEC14L3 .
chr22 31267856 rs2228619 C G 241 . LIMK2 .
chr22 36191797 . ACT A 286 p.Glu111fs APOL4 APOL4\|2\|1.00)
chr22 36967952 rs10637417 T TTTTATTTA 306 . LL22NC01-81G9.3 .
chr22 37226775 rs1064498 A G 183 . RAC2 .
chr22 38961581 rs202076860 T C 88 . APOBEC3A .
chr22 41155035 rs20552 T A 281 . EP300 .
chr22 41940168 rs5758511 G A 238 p.Arg3* CENPM .
chr22 42127526 rs1058172 C T 302 . CYP2D6 .
chr22 42128241 rs35742686 CT C 209 p.Arg259fs CYP2D6 CYP2D6\|2\|1.00)
chr22 42128945 rs3892097 C T 245 . CYP2D6 CYP2D6\|2\|1.00)
chr22 42129796 rs28371705 G C 162 . CYP2D6 .
chr22 42129809 rs28371704 T C 158 . CYP2D6 .
chr22 42130692 rs1065852 G A 307 . CYP2D6 .
chr22 42142513 . T TC 300 p.Glu160fs CYP2D6 CYP2D6.2\|1\|1.00)
chr22 42387064 rs34963472 G A 232 p.Arg189* NFAM1 .
chr22 44606929 rs12152184 A G 199 . LINC00229 .
chr22 49922288 . CCTCAGCAGTCAGGACCGGCCTCTCCGATTCTTACCCG C 223 p.Val198_Pro208delinsAla CRELD2 .
chr22 50255868 rs1129880 A G 235 . MAPK12 .
chr22 50266232 rs2076139 T C 239 . MAPK11 .
chr22 50522253 rs140524 C T 258 . NCAPH2 NCAPH2\|3\|0.67)
chr22 50525807 rs11479 G A 343 . TYMP .

Note that some variants have content in the LOF column; This column reports the fraction of alternate spliced transcripts that carry the variant and can be a good think to consider when you filter variants.


Technical.png When not all splicing isoforms carry a mutation, what is the consequence on the gene function?

 

Apply a more complex filter

novel (no rsID in the 3rd column after applying 'annotate' dbSNP138) HET (from genotype field GT) SNP (after applying 'varType')

  • ( CHROM = 'chr22' ) & ( VARTYPE = 'SNP' ) & ( GEN[*].GT = '0/1' ) & ( ANN[*].IMPACT = 'HIGH' ) & ( ! exists ID )

Now extract the fields as above but from the output of the new filtered data


ex6_14.png
CHROM POS ID REF ALT DP EFF[0].AA EFF[0].GENE LOF GEN[*].GT VARTYPE ANN[0].EFFECT
chr22 18226626 . G A 101 p.Glu102* BID BID\|7\|0.43) 0/1 SNP stop_gained
chr22 18347639 . C T 5 p.Trp1001* MICAL3 . 0/1 SNP stop_gained
chr22 18652610 . T C 195 . USP18 USP18\|1\|1.00) 0/1 SNP splice_acceptor_variant&intron_variant
chr22 19956140 . T C 74 . COMT . 0/1 SNP structural_interaction_variant
chr22 24092897 . T C 239 . ZNF70 ZNF70\|1\|1.00) 0/1 SNP splice_donor_variant&intron_variant
chr22 36680167 . G A 151 p.Glu1913* MYH9 . 0/1 SNP stop_gained
chr22 41566466 . A C 71 . EP300 . 0/1 SNP structural_interaction_variant

 

download exercise files

Download exercise files here

Use the right application to open the files present in ex6-files

References:

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2018 | NGS-Var2018 Exercise.5 | NGS-Var2018 Exercise.7 ]