NGS Exercise.7 SnpEff

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS_Exercise.7 | NGS_Exercise.7_vcfCodingSnps | NGS_Exercise.7_annovar | NGS Exercise.8 ]
# updated 2014 version


Add annotations to variant calling format with SnpEff & SnpSift


SnpEff_logo.png
SnpSift_logo.png

Introduction

Cli tools.png The very powerful and complete toolbox provided with SnpEff and SnpSift will likely be part of future GATK versions. Code below is shown as appetizer. You can install the application on your own machine quite easily using the information provided in the home site and experience with the many possibilities of the software.

Overview of SnpEff and SnpSift commands

SnpEff & SnpSift [1][2] were developed by Pablo Cingolani after vcfCodingSnps (Yanming Li, Goncalo Abecasis)[3] to directly annotate VCF data and filter calls by many different ways. Both programs combine the richness of Annovar[4] annotations and the advantage of manipulating the VCF data directly and without changing format. This session only provides a starter to snpEff. Please refer to the SnpEff manual pages[5] and SnpSift manual pages[6] for more information.

  • Read SnpEff usage in the full GATK GuideBook and how SnpEff annotations can be added to GATK VCF data using the GATK VariantAnnotator tool (regularly check the GATK pages for more recent versions of these documents).

SnpEff annotates and predicts the effects of variants on genes (such as amino acid changes).

• snpEff commands

snpEff version SnpEff 3.6b (build 2014-05-01), by Pablo Cingolani
Usage: snpEff [command] [options] [files]
 
Run 'java -jar snpEff.jar command' for help on each specific command
 
Available commands: 
	[eff]                        : Calculate effect of variants. Default: eff (no command or 'eff').
	build                        : Build a SnpEff database.
	buildNextProt                : Build a SnpEff for NextProt (using NextProt's XML files).
	cds                          : Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file.
	                               Used for checking databases correctness.
	closest                      : Annotate the closest genomic region.
	count                        : Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval.
	databases                    : Show currently available databases (from local config file).
	download                     : Download a SnpEff database.
	dump                         : Dump to STDOUT a SnpEff database (mostly used for debugging).
	genes2bed                    : Create a bed file from a genes list.
	len                          : Calculate total genomic length for each marker type.
	protein                      : Compare protein sequences calculated form a SnpEff database to the one in a FASTA file.
	                               Used for checking databases correctness.
	spliceAnalysis               : Perform an analysis of splice sites. Experimental feature.
 
Generic options:
	-c , -config                 : Specify config file
	-d , -debug                  : Debug mode (very verbose).
	-dataDir <path>              : Override data_dir parameter from config file.
	-h , -help                   : Show this help and exit
	-if , -inOffset              : Offset input by a number of bases. E.g. '-inOffset 1' for one-based TXT input files
	-of , -outOffset             : Offset output by a number of bases. E.g. '-outOffset 1' for one-based TXT output files
	-noLog                       : Do not report usage statistics to server
	-t                           : Use multiple threads (implies '-noStats'). Default 'off'
	-q ,  -quiet                 : Quiet mode (do not show any messages or errors)
	-v , -verbose                : Verbose mode
 
Database options:
	-canon                       : Only use canonical transcripts.
	-interval                    : Use a custom intervals in TXT/BED/BigBed/VCF/GFF file (you may use this option many times)
	-motif                       : Annotate using motifs (requires Motif database).
	-nextProt                    : Annotate using NextProt (requires NextProt database).
	-reg <name>                  : Regulation track to use (this option can be used add several times).
	-onlyReg                     : Only use regulation tracks.
	-onlyTr <file.txt>           : Only use the transcripts in this file. Format: One transcript ID per line.
	-ss , -spliceSiteSize <int>  : Set size for splice sites (donor and acceptor) in bases. Default: 2
	-ud , -upDownStreamLen <int> : Set upstream downstream interval length (in bases)

SnpSift is a toolbox that allows you to filter and manipulate annotated files

• SnpSift commands

SnpSift version SnpSift 3.6b (build 2014-05-01), by Pablo Cingolani
Usage: java -jar SnpSift.jar [command] params...
Command is one of:
	alleleMat     : Create an allele matrix output.
	annotate      : Annotate 'ID' from a database (e.g. dbSnp). Assumes entries are sorted.
	annMem        : Annotate 'ID' from a database (e.g. dbSnp). Loads db in memory. Does not assume sorted entries.
	caseControl   : Compare how many variants are in 'case' and in 'control' groups; calculate p-values.
	ccs           : Case control summary. Case and control summaries by region, allele frequency and variant's functional effect.
	concordance   : Concordance metrics between two VCF files.
	covMat        : Create an covariance matrix output (allele matrix as input).
	dbnsfp        : Annotate with multiple entries from dbNSFP. <EXPERIMENTAL>
	epistasis     : Epistatic case control model (this feature is alpha).
	extractFields : Extract fields from VCF file into tab separated format.
	filter        : Filter using arbitrary expressions
	geneSets      : Annotate using MSigDb gene sets (MSigDb includes: GO, KEGG, Reactome, BioCarta, etc.)
	gt            : Add Genotype to INFO fields and remove genotype fields when possible.
	gwasCat       : Annotate using GWAS catalog
	hwe           : Calculate Hardy-Weimberg parameters and perform a godness of fit test.
	intersect     : Intersect intervals (genomic regions).
	intervals     : Keep variants that intersect with intervals.
	intIdx        : Keep variants that intersect with intervals. Index-based method: 
	                Used for large VCF file and a few intervals to retrieve
	join          : Join files by genomic region.
	phastCons     : Annotate using conservation scores (phastCons).
	private       : Annotate if a variant is private to a family or group.
	rmRefGen      : Remove reference genotypes.
	rmInfo        : Remove INFO fields.
	sift          : Annotate using SIFT scores from a VCF file.
	split         : Split VCF by chromosome.
	tstv          : Calculate transiton to transversion ratio.
	varType       : Annotate variant type (SNP,MNP,INS,DEL or MIXED).
	vcf2tped      : Convert VCF to TPED.

download and install SnpEff and SnpSift on your local machine

Technical.png No need to run this code, the program is ready to be used. Use these commands back in the lab to install your own version and get used to the command syntax.

# download and install the software according to the instructions
# http://snpeff.sourceforge.net/download.html
 
# edit the snpEff.config to point to the correct place for the data to be saved
# default is './data' in the program folder
 
# create a variable to point to the SnpEff installation folder in your .profile (or .bashrc)
## export SNPEFF=/path/to/snpeff
 
# this allows calling snpeff.jar directly from anywhere with
Java -Xmx4g $SNPEFF/snpEff.jar -c $SNPEFF/snpEff.config ...
 
# or if you have saved the reference genome file elsewhere
java -Xmx4g -jar $SNPEFF/snpEff.jar -dataDir <path> ...
 
# first get current version of the software and available command list
java -jar $SNPEFF/snpEff.jar
 
# then get the full list of human genomes to identify the correct human version to install
java -jar $SNPEFF/snpEff.jar databases | grep "Homo_sapiens"
 
# select one and download it (here the most recent build37)
java -jar $SNPEFF/snpEff.jar download -v GRCh37.75; # (37.75 at the time of this edit) favoured reference with full annotations
java -jar $SNPEFF/snpEff.jar download -v hg19; # limited annotations
 
## after running the last two lines, zip files will be left behind in the current folder. 
## You can safely delete these zip files if the installation reported a success.

Add human annotations to the demo 1kg VCF file

It is obvious that this example is just a minimal run just to show the power of the tools and generate data that can be loaded in IGV.
The SnpEff / SnpDiff documentation pages cited above are well made and can easily be read by biologists interested in such analysis.

Error creating thumbnail: Unable to save thumbnail to destination
SnpEff and SnpSift have been chosen by the Broad Institute in complement to the GATK, this is revealing of the quality of these packages

annotate the demo file with GRCh37 data

A test file is provided with the package, we use it here to annotate variants with the most recent human B37 version

# this example demo command annotated a demo file provided with the software 
# we use UCSC style chromosome naming as ##reference=GRCh37 is found in the input file
java -jar $SNPEFF/snpEff.jar \
    GRCh37.75 \
    $SNPEFF/demo.1kg.vcf \
    > $SNPEFF/gwas-annotated_demo.1kg.vcf
 
# some warning are issued during the process
WARNINGS: Some warning were detected
Warning type	Number of warnings
WARNING_TRANSCRIPT_INCOMPLETE	6134
WARNING_TRANSCRIPT_NO_START_CODON	1240

Inspect the 1kg results and isolate interesting calls

expand on the right to see the top results

# inspect the first two lines of input with some reformatting to fit the window size
grep -v "^#" demo.1kg.vcf | head -2 | tr "\t" "\ " | sed -e "s/.\{75\}/&\n/g"
 
1 10583 rs58108140 G A 100 PASS AVGPOST=0.7707;RSQ=0.4319;LDAF=0.2327;ERATE
=0.0161;AN=2184;VT=SNP;AA=.;THETA=0.0046;AC=314;SNPSOURCE=LOWCOV;AF=0.14;AS
N_AF=0.13;AMR_AF=0.17;AFR_AF=0.04;EUR_AF=0.21
1 10611 rs189107123 C G 100 PASS AN=2184;THETA=0.0077;VT=SNP;AA=.;AC=41;ERA
TE=0.0048;SNPSOURCE=LOWCOV;AVGPOST=0.9330;LDAF=0.0479;RSQ=0.3475;AF=0.02;AS
N_AF=0.01;AMR_AF=0.03;AFR_AF=0.01;EUR_AF=0.02
 
# inspect the first two lines after annotation with some reformatting to fit the window size
grep -v "^#"  gwas-annotated_demo.1kg.vcf | head -2 | tr "\t" "\ " | sed -e "s/.\{75\}/&\n/g"
 
1 10583 rs58108140 G A 100.0 PASS AVGPOST=0.7707;RSQ=0.4319;LDAF=0.2327;ERA
TE=0.0161;AN=2184;VT=SNP;AA=.;THETA=0.0046;AC=314;SNPSOURCE=LOWCOV;AF=0.14;
ASN_AF=0.13;AMR_AF=0.17;AFR_AF=0.04;EUR_AF=0.21;EFF=DOWNSTREAM(MODIFIER||37
80|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000423562||1),DOWNSTREA
M(MODIFIER||3780|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000438504
||1),DOWNSTREAM(MODIFIER||3780|||WASH7P|unprocessed_pseudogene|NON_CODING|E
NST00000541675||1),DOWNSTREAM(MODIFIER||3821|||WASH7P|unprocessed_pseudogen
e|NON_CODING|ENST00000488147||1),DOWNSTREAM(MODIFIER||3828|||WASH7P|unproce
ssed_pseudogene|NON_CODING|ENST00000538476||1),INTERGENIC(MODIFIER|||||||||
|1),UPSTREAM(MODIFIER||1286|||DDX11L1|processed_transcript|NON_CODING|ENST0
0000456328||1),UPSTREAM(MODIFIER||1289|||DDX11L1|transcribed_unprocessed_ps
eudogene|NON_CODING|ENST00000515242||1),UPSTREAM(MODIFIER||1291|||DDX11L1|t
ranscribed_unprocessed_pseudogene|NON_CODING|ENST00000518655||1),UPSTREAM(M
ODIFIER||1427|||DDX11L1|transcribed_unprocessed_pseudogene|NON_CODING|ENST0
0000450305||1)
1 10611 rs189107123 C G 100.0 PASS AN=2184;THETA=0.0077;VT=SNP;AA=.;AC=41;E
RATE=0.0048;SNPSOURCE=LOWCOV;AVGPOST=0.9330;LDAF=0.0479;RSQ=0.3475;AF=0.02;
ASN_AF=0.01;AMR_AF=0.03;AFR_AF=0.01;EUR_AF=0.02;EFF=DOWNSTREAM(MODIFIER||37
52|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000423562||1),DOWNSTREA
M(MODIFIER||3752|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000438504
||1),DOWNSTREAM(MODIFIER||3752|||WASH7P|unprocessed_pseudogene|NON_CODING|E
NST00000541675||1),DOWNSTREAM(MODIFIER||3793|||WASH7P|unprocessed_pseudogen
e|NON_CODING|ENST00000488147||1),DOWNSTREAM(MODIFIER||3800|||WASH7P|unproce
ssed_pseudogene|NON_CODING|ENST00000538476||1),INTERGENIC(MODIFIER|||||||||
|1),UPSTREAM(MODIFIER||1258|||DDX11L1|processed_transcript|NON_CODING|ENST0
0000456328||1),UPSTREAM(MODIFIER||1261|||DDX11L1|transcribed_unprocessed_ps
eudogene|NON_CODING|ENST00000515242||1),UPSTREAM(MODIFIER||1263|||DDX11L1|t
ranscribed_unprocessed_pseudogene|NON_CODING|ENST00000518655||1),UPSTREAM(M
ODIFIER||1399|||DDX11L1|transcribed_unprocessed_pseudogene|NON_CODING|ENST0
0000450305||1)
A detailed summary file is generated in html format together with a text file with a list of genes. These files can be reviewed on the BITS server as is for your convenience.

Filter the 1kg annotated VCF file

In this short example, we filter demo variants known from the GWAS catalog. Many other filtering options are available but by lack of time only one is shown here for both the demo set and for our own data.

# we first download the catalog
wget -np http://www.genome.gov/admin/gwascatalog.txt -P $SNPEFF/data
 
# then filter the annotated variants generated above and the GWAS catalog
# 'tee' saves a copy of the stdout to file while the data is ALSO sent to screen
java -jar $SNPEFF/SnpSift.jar \
      gwasCat $SNPEFF/data/gwascatalog.txt \
      $SNPEFF/gwas-annotated_demo.1kg.vcf | \
      tee $SNPEFF/gwas-annotated_demo.1kg-test.vcf

For our VCF data this gives the following results

infolder="Final_Results/vcf-venn/gold-set"
infile="chr21_common_NA18507.vcf.gz"
outfolder="snpeff_chr21-results"
mkdir -p ${outfolder}
 
# then filter the BWA variants with the GWAS catalog
java -jar $SNPEFF/SnpSift.jar \
      gwasCat $SNPEFF/data/gwascatalog.txt \
      ${infolder}/${infile} >\
      ${outfolder}/gwas-${infile%%.gz}
 
# count how many lines correspond to GWAS catalog records
grep -c GWASCAT ${outfolder}/gwas-${infile%%.gz}
> 82
 
# isolate top 9 lines with GWAS annotation
grep GWASCAT ${outfolder}/gwas-${infile%%.gz} | head

review GWAS hits

##INFO=<ID=GWASCAT,Number=.,Type=String,Description="Trait related to this chromosomal position, according to GWAS catalog">
chr21	11002011	.	T	C	81.0	.	SF=0,1,2;GWASCAT=Bipolar_disorder_and_schizophrenia	GT:PL:GQ	0/1:111,0,255:99
chr21	16340289	.	C	T	225.0	.	SF=0,1,2;GWASCAT=Cognitive_performance	GT:PL:GQ	0/1:255,0,255:99
chr21	16520832	.	G	A	222.0	.	SF=0,1,2;GWASCAT=Breast_cancer	GT:PL:GQ	1/1:255,138,0:99
chr21	16817051	.	A	G	225.0	.	SF=0,1,2;GWASCAT=Ulcerative_colitis	GT:PL:GQ	0/1:255,0,229:99
chr21	16914905	.	G	A	225.0	.	SF=0,1,2;GWASCAT=Parkinson_s_disease	GT:PL:GQ	0/1:255,0,255:99
chr21	17069166	.	C	T	225.0	.	SF=0,1,2;GWASCAT=Panic_disorder	GT:PL:GQ	0/1:255,0,255:99
chr21	17483133	.	A	T	222.0	.	SF=0,1,2;GWASCAT=Obesity_related_traits	GT:PL:GQ	1/1:255,120,0:99
chr21	17667720	.	T	C	225.0	.	SF=0,1,2;GWASCAT=Chronic_obstructive_pulmonary_disease_related_biomarkers	GT:PL:GQ	0/1:255,0,255:99
chr21	18046232	.	A	G	225.0	.	SF=0,1,2;GWASCAT=Amyotrophic_lateral_sclerosis	GT:PL:GQ	0/1:255,0,255:99

Annotate NA18507-chr21 common variants with hg19

The VCF file generated in [Exercise.6] was processed and results are available behind the link below.

whereami="$(pwd)"
infolder="../Final_Results/vcf-venn/gold-set"
infile="chr21_common_NA18507.vcf.gz"
outfolder="snpeff_chr21-results"
mkdir -p ${outfolder}
 
# we use here NCBI style chromosome naming as ##reference=hg19 is found in the input file
cd ${outfolder}
java -Xmx4g -jar $SNPEFF/snpEff.jar \
      -c $SNPEFF/snpEff.config \
      hg19 \
      ${infolder}/${infile} \
      > annotated_chr21_common_NA18507.vcf
 
# return to base
cd ${whereami}

VCF annotation results

head -100 /home/bits/NGS_DNASeq-training/snpeff_chr21-results/annotated_chr21_common_NA18507.vcf 
##fileformat=VCFv4.1
##samtoolsVersion=0.1.19+
##reference=file://ref/HiSeq_UCSC_hg19.fa
##contig=<ID=chrM,length=16571>
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=QBD,Number=1,Type=Float,Description="Quality by Depth: QUAL/#reads">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Read Position Bias">
##INFO=<ID=MDV,Number=1,Type=Integer,Description="Maximum number of high-quality nonRef reads in samples">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data. Note: this version may be broken.">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality non-reference bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##fileDate=20120126
##source=CASAVA-1.8
##sitesMaxDepth_chr10=121.296617413957
##indelsMaxDepth_chr10=123.655927166378
##INFO=<ID=VARTYPE_SNV,Number=0,Type=Flag,Description="“Locus desribes a single nucleotide variant”">
##INFO=<ID=VARTYPE_INS,Number=0,Type=Flag,Description="“Locus describes at least one simple insertion allele”">
##INFO=<ID=VARTYPE_DEL,Number=0,Type=Flag,Description="“Local describes at least one simple deletion allele”">
##INFO=<ID=VARTYPE_COMPLEX,Number=0,Type=Flag,Description="“Locus desribes at least one complex indel allele expressed as an insertion/deletion combination">
##INFO=<ID=VARTYPE_BREAKEND,Number=0,Type=Flag,Description="“Locus desribes at least one open breakend allele">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="“Type of structural variant”">
##INFO=<ID=CIGAR,Number=A,Type=String,Description="CIGAR alignment for each alternate allele at loci containing indels other than breakends.">
##INFO=<ID=RU,Number=A,Type=String,Description="Smallest repeating sequence unit extended or contracted in the indel allele relative to the reference. RUs are not reported if longer than 20 bases or if both REFREP and IDREP are less than 2.">
##INFO=<ID=REFREP,Number=A,Type=String,Description="Number of times RU is repeated in reference.">
##INFO=<ID=IDREP,Number=A,Type=String,Description="Number of times RU is repeated in indel allele.">
##FILTER=<ID=QGT20,Description="Locus genotype quality is less than 20 or not computable.">
##FILTER=<ID=FILT30,Description="More than 30% of bases at a site are filtered out.">
##FILTER=<ID=SitesMaxDepth,Description="Site occurs at a filtered depth greater than 'sitesMaxDepth'.">
##FILTER=<ID=IndelsMaxDepth,Description="Indel occurs at a depth greater than 'indelsMaxDepth'.">
##FILTER=<ID=REPEAT8,Description="Indel occurs in a homopolymer or dinucleotide track with a reference repeat greater than 8.">
##FILTER=<ID=IndelConflict,Description="Conflicting indel/breakend evidence in region.">
##FILTER=<ID=SiteConflict,Description="Site conflicts with indel/breakend evidence in region.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DPU,Number=1,Type=Integer,Description="Basecalls used to genotype site after filtration">
##FORMAT=<ID=DPF,Number=1,Type=Integer,Description="Basecalls filtered from input prior to site genotyping">
##FORMAT=<ID=AU,Number=4,Type=Integer,Description="Used A,C,G,T basecalls greater than Q20">
##FORMAT=<ID=DPI,Number=1,Type=Integer,Description="Basecall depth associated with indel, taken from the preceeding site.">
##source_20131105.1=vcf-isec(r899) -f -o -n =3 -w 1 bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz bcftools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz
##sourceFiles_20131105.1=0:bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz,1:bcftools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz,2:public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz
##INFO=<ID=SF,Number=.,Type=String,Description="Source File (index to sourceFiles, f when filtered)">
##SnpEffVersion="3.6b (build 2014-05-01), by Pablo Cingolani"
##SnpEffCmd="SnpEff  hg19 Final_Results/vcf-venn/gold-set/chr21_common_NA18507.vcf.gz "
##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank  | Genotype_Number [ | ERRORS | WARNINGS ] )' ">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	GAIIx-chr21-BWA.mem
chr21	9467416	.	C	T	86.0	.	SF=0,1;EFF=INTERGENIC(MODIFIER||||||||||1)	GT:PL:GQ	0/1:116,0,126:99
chr21	9467417	.	A	C	111.0	.	SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1)	GT:PL:GQ	0/1:141,0,95:98
chr21	9471670	.	A	G	37.8	.	SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1)	GT:PL:GQ	1/1:69,6,0:10
chr21	9472931	.	T	G	52.0	.	SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1)	GT:PL:GQ	1/1:84,9,0:16
chr21	9473159	.	A	G	52.0	.	SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1)	GT:PL:GQ	1/1:84,9,0:16
chr21	9473186	.	T	C	89.5	.	SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1)	GT:PL:GQ	1/1:122,12,0:21
chr21	9474141	.	T	G	8.64	.	SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1)	GT:PL:GQ	0/1:38,0,89:40
chr21	9474404	.	A	C	47.0	.	SF=0,1,2;EFF=INTERGENIC(MODIFIER||||||||||1)	GT:PL:GQ	1/1:79,9,0:16
A detailed summary file is generated in html format together with a text file with a list of genes. These files can be reviewed on the BITS server as is for your convenience.

Technical.png Unlike GRCh37 which is very detailed and returns many nice graphics in html form, the hg19 genome database is quite limited and some of the SnpEff html output will be missing.

Use snpEff in a pipe

Combined with VCF filtering or subsetting, the annotation can be done on a specific subset of lines easily.

## not run!
# we use here the '-' symbol for <stdin> and the data comes from a piped 'cat' command stream
cat <some-data.vcf> | \
Java -Xmx4g -jar $SNPEFF/snpEff.jar \
      -c $SNPEFF/snpEff.config \
      hg19 \
      - \
      > annotated_subset_eff.vcf

download exercise files

Download exercise files here

Use the right application to open the files present in ex7-files

References:
  1. http://snpeff.sourceforge.net
  2. Pablo Cingolani, Adrian Platts, Le Lily Wang, Melissa Coon, Tung Nguyen, Luan Wang, Susan J Land, Xiangyi Lu, Douglas M Ruden
    A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.
    Fly (Austin): 2012, 6(2);80-92
    [PubMed:22728672] ##WORLDCAT## [DOI] (I p)

  3. http://www.sph.umich.edu/csg/liyanmin/vcfCodingSnps
  4. http://www.openbioinformatics.org/annovar/
  5. http://snpeff.sourceforge.net/SnpEff_manual.html
  6. http://snpeff.sourceforge.net/SnpSift.html

[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS_Exercise.7 | NGS_Exercise.7_vcfCodingSnps | NGS_Exercise.7_annovar | NGS Exercise.8 ]