NGS-Var Exercise.6

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.5 | NGS-Var Exercise.7 ]


Annotate and filter VCF variant lists with SnpEff and SNPSift


SnpEff_logo.png
SnpSift_logo.png

Choose the right tool to enrich your VCF data

A growing number of tools are available to annotate and select from VCF files. The choice of the best tool for your application depends on several factors.

  • when you need the job done and do not worry about the flexibility, we advise to use SnpEff and the companion SnpSift which are both easy to use java programs.
  • if you wish to add annotations from third-party databases that are not present in the other tools, or if you work on a organism absent from the above tool, you may consider using Annovar that was included in our former training session ([1]).
  • when you only need to annotate a few VCF rows, you are welcome to use public servers like:
    • the EnsEMBL VEP server (introduced in our related Wiki page [2])
    • the UCSC VAI ([3])
    • the SeatleSeq server ([4])

Technical.png Submitting 'patentable' information to the WEB infringes the novelty clause and will expose patient information to the internet, and the size of input is limited to few 100's lines

  • other tools have been used with success like vcfCodingSnps ([5])

Install and configure SnpEff

snpEff comes with default settings that need be edited and you also have to choose the relevant database for your genome of interest. Since we mapped using hg19, we should match this choice in snpEff and download the corresponding annotations. Please read more information about setting up this program on the official snpEff pages.

snpEff configuration

# download and install the software according to the instructions
# http://snpeff.sourceforge.net/SnpEff_manual.html#install
 
# edit the snpEff.config to point to the correct place for the data to be saved
# default is './data' in the program folder
 
# create a variable to point to the SnpEff installation folder in your .profile (or .bashrc)
## export SNPEFF=/path/to/snpeff
## export SNPEFFDB=/path/to/snpeff/data
 
# identify the name of the reference data corresponding to your assembly
java -jar $SNPEFF/snpEff.jar databases | grep "Homo_sapiens"
 
GRCh37.70     Homo_sapiens   http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_GRCh37.70.zip
GRCh37.75     Homo_sapiens (OK)  http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_GRCh37.75.zip
GRCh37.GTEX   Homo_sapiens,Gencode 12,GTEX project  http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_GRCh37.GTEX.zip
GRCh38.81     Homo_sapiens   http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_GRCh38.81.zip
GRCh38.82     Homo_sapiens   http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_GRCh38.82.zip
hg19          Homo_sapiens   (UCSC OK)  http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_hg19.zip
hg19kg        Homo_sapiens   (UCSC KnownGenes)    http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_hg19kg.zip
hg38          Homo_sapiens   (UCSC)    http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_hg38.zip
hg38kg        Homo_sapiens   (UCSC KnownGenes)   http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_hg38kg.zip
testHg19ChrM  Homo_sapiens   (UCSC)  http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_testHg19ChrM.zip
 
# download and install the hg19 database (DO NOT RUN - was done for you already)
# java -jar $SNPEFF/snpEff.jar download -v hg19

 

Add annotations to VCF data with snpEff

SnpEff & SnpSift [1][2] were developed by Pablo Cingolani after vcfCodingSnps (Yanming Li, Goncalo Abecasis)[3] to directly annotate VCF data and filter calls by many different ways. Both programs combine the richness of Annovar[4] annotations and the advantage of manipulating the VCF data directly and without changing format. This session only provides a starter to snpEff. Please refer to the SnpEff manual pages[5] and SnpSift manual pages[6] for more information.

  • Please read SnpEff usage in the full GATK GuideBook and how SnpEff annotations can be added to GATK VCF data using the GATK VariantAnnotator tool (regularly check the GATK pages for more recent versions of these documents).

snpEff full command list

Usage: snpEff [command] [options] [files]
 
Run 'java -jar snpEff.jar command' for help on each specific command
 
Available commands:
        [eff|ann]                    : Annotate variants / calculate effects (you can use either 'ann' or 'eff', they mean the same). Default: ann (no command or 'ann').
        build                        : Build a SnpEff database.
        buildNextProt                : Build a SnpEff for NextProt (using NextProt's XML files).
        cds                          : Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness.
        closest                      : Annotate the closest genomic region.
        count                        : Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval.
        databases                    : Show currently available databases (from local config file).
        download                     : Download a SnpEff database.
        dump                         : Dump to STDOUT a SnpEff database (mostly used for debugging).
        genes2bed                    : Create a bed file from a genes list.
        len                          : Calculate total genomic length for each marker type.
        pdb                          : Build interaction database (based on PDB data).
        protein                      : Compare protein sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness.
        show                         : Show a text representation of genes or transcripts coordiantes, DNA sequence and protein sequence.
 
Generic options:
        -c , -config                 : Specify config file
        -configOption name=value     : Override a config file option
        -d , -debug                  : Debug mode (very verbose).
        -dataDir <path>              : Override data_dir parameter from config file.
        -download                    : Download a SnpEff database, if not available locally. Default: true
        -nodownload                  : Do not download a SnpEff database, if not available locally.
        -h , -help                   : Show this help and exit
        -noLog                       : Do not report usage statistics to server
        -t                           : Use multiple threads (implies '-noStats'). Default 'off'
        -q , -quiet                  : Quiet mode (do not show any messages or errors)
        -v , -verbose                : Verbose mode
        -version                     : Show version number and exit
 
Database options:
        -canon                       : Only use canonical transcripts.
        -interaction                 : Annotate using inteactions (requires interaciton database). Default: true
        -interval <file>             : Use a custom intervals in TXT/BED/BigBed/VCF/GFF file (you may use this option many times)
        -maxTSL <TSL_number>         : Only use transcripts having Transcript Support Level lower than <TSL_number>.
        -motif                       : Annotate using motifs (requires Motif database). Default: true
        -nextProt                    : Annotate using NextProt (requires NextProt database).
        -noGenome                    : Do not load any genomic database (e.g. annotate using custom files).
        -noInteraction               : Disable inteaction annotations
        -noMotif                     : Disable motif annotations.
        -noNextProt                  : Disable NextProt annotations.
        -onlyReg                     : Only use regulation tracks.
        -onlyProtein                 : Only use protein coding transcripts. Default: false
        -onlyTr <file.txt>           : Only use the transcripts in this file. Format: One transcript ID per line.
        -reg <name>                  : Regulation track to use (this option can be used add several times).
        -ss , -spliceSiteSize <int>  : Set size for splice sites (donor and acceptor) in bases. Default: 2
        -spliceRegionExonSize <int>  : Set size for splice site region within exons. Default: 3 bases
        -spliceRegionIntronMin <int> : Set minimum number of bases for splice site region within intron. Default: 3 bases
        -spliceRegionIntronMax <int> : Set maximum number of bases for splice site region within intron. Default: 8 bases
        -strict                      : Only use 'validated' transcripts (i.e. sequence has been checked). Default: false
        -ud , -upDownStreamLen <int> : Set upstream downstream interval length (in bases)

annotate the snfEff demo file with hg19

A test file is provided with the package, we use it here to annotate variants with the human hg19 version

# this example demo command annotated a demo file provided with the software 
# we use ##reference=hg19
 
outfolder=snpEff-test
mkdir -p $BASE/${outfolder}
 
# take a small sample to save time
head -11 $SNPEFF/examples/test.1KG.vcf > $BASE/${outfolder}/test.1KG.vcf
 
java -jar $SNPEFF/snpEff.jar hg19 \
    $BASE/${outfolder}/test.1KG.vcf \
    > $BASE/${outfolder}/test.1KG_hg19.vcf
 
# inspect input
cat $BASE/${outfolder}/test.1KG.vcf
 
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	
1	10291	.	C	T	2373.79	.	AC=149
1	10303	.	C	T	294.20	.	AC=32
1	10309	.	C	T	164.52	.	AC=23
1	10315	.	C	T	394.78	.	AC=47
1	10457	.	A	C	217.73	.	AC=16
1	10469	rs117577454	C	G	365.78	.	AC=30
1	10492	rs55998931	C	T	1309.47	.	AC=72
1	10575	.	C	G	7.23	.	AC=1
1	10583	rs58108140	G	A	2817.71	.	AC=154
1	10611	.	C	G	200.55	.	AC=17
 
# the output became (until first record for the sake of space)
head -7 $BASE/${outfolder}/test.1KG_hg19.vcf
 
##SnpEffVersion="4.2 (build 2015-12-05), by Pablo Cingolani"
##SnpEffCmd="SnpEff  hg19 /home/bits/NGS/Variant/test.1KG.vcf "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected' ">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected' ">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    
1       10291   .       C       T       2373.79 .       AC=149;ANN=T|upstream_gene_variant|MODIFIER|DDX11L1|DDX11L1|transcript|NR_046018.2|pseudogene||n.-1583C>T|||||1583|,T|downstream_gene_variant|MODIFIER|WASH7P|WASH7P|transcript|NR_024540.1|pseudogene||n.*4071G>A|||||4071|,T|intergenic_region|MODIFIER|DDX11L1|DDX11L1|intergenic_region|DDX11L1|||n.10291C>T||||||

Technical.png Scroll to the right in the results above and see why you do not want to read in VCF files as they come

same output with lines cut at 80 characters (with the nice cli GNU app 'fold')

head -7 $BASE/${outfolder}/test.1KG_hg19.vcf | fold -w 80
 
##SnpEffVersion="4.2 (build 2015-12-05), by Pablo Cingolani"
##SnpEffCmd="SnpEff  hg19 /home/bits/NGS/Variant/test.1KG.vcf "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele
 | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature
_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS
.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effe
cts for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_ge
ne | Percent_of_transcripts_affected' ">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated dec
ay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcript
s_in_gene | Percent_of_transcripts_affected' ">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	
1	10291	.	C	T	2373.79	.	AC=149;ANN=T|upstream_ge
ne_variant|MODIFIER|DDX11L1|DDX11L1|transcript|NR_046018.2|pseudogene||n.-1583C>
T|||||1583|,T|downstream_gene_variant|MODIFIER|WASH7P|WASH7P|transcript|NR_02454
0.1|pseudogene||n.*4071G>A|||||4071|,T|intergenic_region|MODIFIER|DDX11L1|DDX11L
1|intergenic_region|DDX11L1|||n.10291C>T||||||

Next to the annotated VCF, two more files are generated:

  • a text file reporting genes with one transcript model per row and variant type counts (saved here)
  • a HTML report with much more information that you can review and print when needed (saved here)

Handicon.png The additional files can be omitted with the extra argument '-noStats'

annotate the Varscan2 and bcftools calls with the content of the SnpEFF-hg19 database

Handicon.png Always match the annotation database to what was used for mapping or you risk to add annotations in random places when two reference build differ

##varscan2 calls
invcf=varscan2_variants/chr21_NA18507_varscan.vcf.gz
outfolder=VCF_annotation
outvcf=chr21_NA18507_varscan-hg19.vcf
build=hg19
 
mkdir -p ${outfolder}
 
# annotate and save all results to $outfolder
java -jar $SNPEFF/snpEff.jar \
    -htmlStats ${outfolder}/varscan_snpEff_summary.html \
    ${build} ${invcf} > $outfolder/${outvcf}
 
# create .gz version and index
vcf2index  $outfolder/${outvcf}
 
## bcftools calls
invcf=bcftools_htslib_variants/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz
outfolder=VCF_annotation
outvcf=chr21_NA18507_bcftools-hg19.vcf
build=hg19
 
# annotate and save all results to $outfolder
java -jar $SNPEFF/snpEff.jar \
    -htmlStats ${outfolder}/bcftools_snpEff_summary.html \
    ${build} ${invcf} > $outfolder/${outvcf}
 
# create .gz version and index
vcf2index  $outfolder/${outvcf}

The results of this command were copied to the BITS server:

As you can see from the report, the annotation added very valuable information to an otherwise quite flat list of genomic coordinates. The next User wish will be to isolate damaging variants or variants with non-synonymous effects. This is very pertinent and will be done in the next part using SnpSift.  

Filter and select relevant data from a VCF file with SnpSift

SnpSift is a toolbox that allows you to filter and manipulate snpEff-annotated files

SnpSift full command list

SnpSift version 4.2 (build 2015-12-05), by Pablo Cingolani
 
Usage: java -jar SnpSift.jar [command] params...
Command is one of:
        alleleMat     : Create an allele matrix output.
        annotate      : Annotate 'ID' from a database (e.g. dbSnp). Assumes entries are sorted.
        annMem        : Annotate 'ID' from a database (e.g. dbSnp). Loads db in memory. Does not assume sorted entries.
        caseControl   : Compare how many variants are in 'case' and in 'control' groups; calculate p-values.
        ccs           : Case control summary. Case and control summaries by region, allele frequency and variant's functional effect.
        concordance   : Concordance metrics between two VCF files.
        covMat        : Create an covariance matrix output (allele matrix as input).
        dbnsfp        : Annotate with multiple entries from dbNSFP.
        extractFields : Extract fields from VCF file into tab separated format.
        filter        : Filter using arbitrary expressions
        geneSets      : Annotate using MSigDb gene sets (MSigDb includes: GO, KEGG, Reactome, BioCarta, etc.)
        gt            : Add Genotype to INFO fields and remove genotype fields when possible.
        gtfilter      : Filter genotype using arbitrary expressions.
        gwasCat       : Annotate using GWAS catalog
        hwe           : Calculate Hardy-Weimberg parameters and perform a godness of fit test.
        intersect     : Intersect intervals (genomic regions).
        intervals     : Keep variants that intersect with intervals.
        intIdx        : Keep variants that intersect with intervals. Index-based method: Used for large VCF file and a few intervals to retrieve
        join          : Join files by genomic region.
        phastCons     : Annotate using conservation scores (phastCons).
        private       : Annotate if a variant is private to a family or group.
        rmRefGen      : Remove reference genotypes.
        rmInfo        : Remove INFO fields.
        split         : Split VCF by chromosome.
        tstv          : Calculate transiton to transversion ratio.
        varType       : Annotate variant type (SNP,MNP,INS,DEL or MIXED).
        vcfCheck      : Check that VCF file is well formed.
        vcf2tped      : Convert VCF to TPED.
 
Options common to all SnpSift commands:
        -d                   : Debug.
        -download            : Download database, if not available locally. Default: true.
        -noDownload          : Do not download a database, if not available locally.
        -noLog               : Do not report usage statistics to server.
        -h                   : Help.
        -v                   : Verbose.

find variants with impact

The first question when it comes to variants is which of them do have a strong impact on the gene product=Protein.

infolder=VCF_annotation
infile=chr21_NA18507_bcftools-hg19.vcf.gz
# infile=chr21_NA18507_varscan-hg19.vcf.gz
 
outfolder=VCF_filtering
mkdir -p ${outfolder}
outfile=filtered-${infile}
 
# filter 'STOP' variants and display results on screen 
# (goody: in bunches of 80 characters with one blank line between two calls)
java -jar $SNPEFF/SnpSift.jar \
    filter "ANN[0].EFFECT has 'stop_gained'" ${infolder}/${infile} | \
    awk 1 ORS='\n\n' | fold -w 80
 
# only one variant is found by this command in a region reported as pseudogen
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  GAIIx-chr21-BWA.mem
chr21   35334572        .       C       T       225.009 .       ANN=T|stop_gained|HIGH|LINC00649|LINC00649|transc
ript|NM_001288961.1|protein_coding|2/2|c.283C>T|p.Gln95*|573/2263|283/405|95/134
||,T|intron_variant|MODIFIER|LINC00649|LINC00649|transcript|NR_038883.1|pseudoge
ne|2/2|n.618-6966C>T||||||      GT:PL   0/1:255,0,255

Which are the non-synonymous variants?

infolder=VCF_annotation
infile=chr21_NA18507_bcftools-hg19.vcf.gz
# infile=chr21_NA18507_varscan-hg19.vcf.gz
 
outfolder=VCF_filtering
mkdir -p ${outfolder}
 
# filter 'missense_variant' variants and save results to file
java -jar $SNPEFF/SnpSift.jar \
    filter "ANN[0].EFFECT has 'missense_variant'" ${infolder}/${infile} \
    > ${outfolder}/non-synonymous-${infile%%.gz}
 
 
# compress and index
vcf2index ${outfolder}/non-synonymous-${infile%%.gz}
 
# how many did we get?
grep -c -v "^#" ${outfolder}/non-synonymous-chr21_NA18507_bcftools-hg19.vcf
210

How many HIGH-IMPACT variants are predicted?

infolder=VCF_annotation
infile=chr21_NA18507_bcftools-hg19.vcf.gz
# infile=chr21_NA18507_varscan-hg19.vcf.gz
 
outfolder=VCF_filtering
mkdir -p ${outfolder}
 
# filter 'HIGH impact' variants and save results to file
java -jar $SNPEFF/SnpSift.jar filter  "EFF[*].IMPACT = 'HIGH'" ${infolder}/${infile} \
    > ${outfolder}/HIGH-${infile%%.gz}
 
# compress and index
vcf2index ${outfolder}/HIGH-${infile%%.gz}
 
# how many did we get?
grep -c -v "^#" ${outfolder}/HIGH-chr21_NA18507_bcftools-hg19.vcf
11
 
# review them
grep -v "^#" ${outfolder}/HIGH-chr21_NA18507_bcftools-hg19.vcf | \
    awk 1 ORS='\n\n' | fold -w 80

HIGH impact results (n=11)

chr21   11098723        .       T       C       225.009 .       ANN=C|splice_donor_variant&intron_variant|HIGH|BA
GE4|BAGE4|transcript|NM_181704.1|protein_coding|1/8|c.14+1A>G||||||,C|splice_don
or_variant&intron_variant|HIGH|BAGE5|BAGE5|transcript|NM_182484.1|protein_coding
|1/8|c.14+1A>G||||||WARNING_TRANSCRIPT_INCOMPLETE,C|splice_donor_variant&intron_
variant|HIGH|BAGE|BAGE|transcript|NM_001187.1|protein_coding|1/4|c.14+1A>G||||||
WARNING_TRANSCRIPT_INCOMPLETE,C|5_prime_UTR_variant|MODIFIER|BAGE3|BAGE3|transcr
ipt|NM_182481.1|protein_coding|1/10|c.-6A>G|||||6|,C|5_prime_UTR_variant|MODIFIE
R|BAGE2|BAGE2|transcript|NM_182482.2|protein_coding|1/10|c.-6A>G|||||6|;LOF=(BAG
E|BAGE|1|1.00),(BAGE4|BAGE4|1|1.00),(BAGE5|BAGE5|1|1.00)        GT:PL   0/1:255,0,255
 
chr21   14437495        .       C       G       143.032 .       ANN=G|splice_acceptor_variant&intron_variant|HIGH
|ANKRD30BP2|ANKRD30BP2|transcript|NR_026916.1|pseudogene|8/11|n.2166-1C>G|||||| 
GT:PL   1/1:176,24,0
 
chr21   31913981        .       AG      A       214.458 .       ANN=A|frameshift_variant|HIGH|KRTAP19-6|KRTAP19-
6|transcript|NM_181612.3|protein_coding|1/1|c.171delC|p.Tyr58fs|202/330|171/177|
57/58||,A|splice_acceptor_variant&splice_donor_variant&intron_variant|HIGH|KRTAP
19-6|KRTAP19-6|transcript|NM_001303120.1|protein_coding|1/1|c.170+1delC||||||;LO
F=(KRTAP19-6|KRTAP19-6|2|0.50)  GT:PL   1/1:255,135,0
 
chr21   31971075        .       TA      TAA     217.468 .       ANN=TAA|frameshift_variant|HIGH|KRTAP6-2|KRTAP
6-2|transcript|NM_181604.1|protein_coding|1/1|c.117_118insT|p.Tyr40fs|117/189|11
7/189|39/62||,TAA|upstream_gene_variant|MODIFIER|KRTAP22-1|KRTAP22-1|transcript|
NM_181620.1|protein_coding||c.-2364dupA|||||2363|;LOF=(KRTAP6-2|KRTAP6-2|1|1.00)
        GT:PL   0/1:255,0,255
 
chr21   32201969        .       GAA     GA      214.458 .       ANN=GA|frameshift_variant&splice_region_varian
t|HIGH|KRTAP7-1|KRTAP7-1|transcript|NM_181606.2|protein_coding|1/2|c.47delT|p.Il
e16fs|81/693|47/264|16/87||;LOF=(KRTAP7-1|KRTAP7-1|1|1.00)      GT:PL   1/1:255,120,0
 
chr21   34948684        .       GA      GAA     214.458 .       ANN=GAA|frameshift_variant|HIGH|SON|SON|transc
ript|NM_138927.2|protein_coding|12/12|c.7236dupA|p.Ala2413fs|7292/8426|7237/7281
|2413/2426||,GAA|frameshift_variant|HIGH|SON|SON|transcript|NM_001291412.1|prote
in_coding|11/11|c.1320dupA|p.Ala441fs|1376/2510|1321/1365|441/454||,GAA|downstre
am_gene_variant|MODIFIER|DONSON|DONSON|transcript|NM_017613.3|protein_coding||c.
*1927_*1928insT|||||1173|,GAA|intron_variant|MODIFIER|SON|SON|transcript|NR_1037
97.1|pseudogene|12/12|n.7357-39dupA||||||       GT:PL   1/1:255,87,0
 
chr21   34948696        .       GA      G       139.457 .       ANN=G|frameshift_variant|HIGH|SON|SON|transcript
|NM_138927.2|protein_coding|12/12|c.7248delA|p.Arg2416fs|7303/8426|7248/7281|241
6/2426||,G|frameshift_variant|HIGH|SON|SON|transcript|NM_001291412.1|protein_cod
ing|11/11|c.1332delA|p.Arg444fs|1387/2510|1332/1365|444/454||,G|downstream_gene_
variant|MODIFIER|DONSON|DONSON|transcript|NM_017613.3|protein_coding||c.*1916del
T|||||1162|,G|intron_variant|MODIFIER|SON|SON|transcript|NR_103797.1|pseudogene|
12/12|n.7357-27delA||||||       GT:PL   1/1:180,66,0
 
chr21   35334572        .       C       T       225.009 .       ANN=T|stop_gained|HIGH|LINC00649|LINC00649|transc
ript|NM_001288961.1|protein_coding|2/2|c.283C>T|p.Gln95*|573/2263|283/405|95/134
||,T|intron_variant|MODIFIER|LINC00649|LINC00649|transcript|NR_038883.1|pseudoge
ne|2/2|n.618-6966C>T||||||      GT:PL   0/1:255,0,255
 
chr21   45670770        .       T       C       54.0072 .       ANN=C|protein_protein_contact|HIGH|DNMT3L|DNMT3L|
interaction|NM_175867.2:4U7T_B:238_278|protein_coding|10/12|c.832A>G||||||,C|pro
tein_protein_contact|HIGH|DNMT3L|DNMT3L|interaction|NM_175867.2:4U7T_D:238_278|p
rotein_coding|10/12|c.832A>G||||||,C|missense_variant|MODERATE|DNMT3L|DNMT3L|tra
nscript|NM_013369.3|protein_coding|10/12|c.832A>G|p.Arg278Gly|1316/1706|832/1164
|278/387||,C|missense_variant|MODERATE|DNMT3L|DNMT3L|transcript|NM_175867.2|prot
ein_coding|10/12|c.832A>G|p.Arg278Gly|1316/1703|832/1161|278/386||      GT:PL   0/1:84,
0,97
 
chr21   45994841        .       A       C       124.008 .       ANN=C|stop_lost|HIGH|KRTAP10-4|KRTAP10-4|transcri
pt|NM_198687.2|protein_coding|1/1|c.1206A>C|p.Ter402Cysext*?|1236/1643|1206/1206
|402/401||,C|downstream_gene_variant|MODIFIER|KRTAP10-5|KRTAP10-5|transcript|NM_
198694.3|protein_coding||c.*4799T>G|||||4491|,C|intron_variant|MODIFIER|TSPEAR|T
SPEAR|transcript|NM_144991.2|protein_coding|1/11|c.83-6952T>G||||||,C|intron_var
iant|MODIFIER|TSPEAR|TSPEAR|transcript|NM_001272037.1|protein_coding|2/12|c.-122
-6952T>G||||||  GT:PL   0/1:154,0,134
 
chr21   46703410        .       C       T       119.008 .       ANN=T|protein_protein_contact|HIGH|POFUT2|POFUT2|
interaction|NM_133635.4:4AP5_A:139_191|protein_coding|3/9|c.415G>A||||||,T|prote
in_protein_contact|HIGH|POFUT2|POFUT2|interaction|NM_133635.4:4AP5_A:139_193|pro
tein_coding|3/9|c.415G>A||||||,T|protein_protein_contact|HIGH|POFUT2|POFUT2|inte
raction|NM_133635.4:4AP5_B:139_191|protein_coding|3/9|c.415G>A||||||,T|protein_p
rotein_contact|HIGH|POFUT2|POFUT2|interaction|NM_133635.4:4AP5_B:139_193|protein
_coding|3/9|c.415G>A||||||,T|missense_variant|MODERATE|POFUT2|POFUT2|transcript|
NM_133635.4|protein_coding|3/9|c.415G>A|p.Val139Ile|440/2869|415/1290|139/429||,
T|missense_variant|MODERATE|POFUT2|POFUT2|transcript|NM_015227.4|protein_coding|
3/8|c.415G>A|p.Val139Ile|440/4823|415/1275|139/424||,T|upstream_gene_variant|MOD
IFIER|LOC642852|LOC642852|transcript|NR_026943.1|pseudogene||n.-4557C>T|||||4557
|,T|non_coding_exon_variant|MODIFIER|POFUT2|POFUT2|transcript|NR_004858.1|pseudo
gene|3/10|n.440G>A||||||        GT:PL   0/1:149,0,136

Many more and diverse operations (and complex combinations thereof) can be done using this tool.
Please read the full documentation here and try some commands using our VCF data.  

download exercise files

Download exercise files here

Use the right application to open the files present in ex6-files

References:
  1. http://snpeff.sourceforge.net
  2. Pablo Cingolani, Adrian Platts, Le Lily Wang, Melissa Coon, Tung Nguyen, Luan Wang, Susan J Land, Xiangyi Lu, Douglas M Ruden
    A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.
    Fly (Austin): 2012, 6(2);80-92
    [PubMed:22728672] ##WORLDCAT## [DOI] (I p)

    Pablo Cingolani, Viral M Patel, Melissa Coon, Tung Nguyen, Susan J Land, Douglas M Ruden, Xiangyi Lu
    Using Drosophila melanogaster as a Model for Genotoxic Chemical Mutational Studies with a New Program, SnpSift.
    Front Genet: 2012, 3;35
    [PubMed:22435069] ##WORLDCAT## [DOI] (I e)

  3. http://www.sph.umich.edu/csg/liyanmin/vcfCodingSnps
  4. http://www.openbioinformatics.org/annovar/
  5. http://snpeff.sourceforge.net/SnpEff_manual.html
  6. http://snpeff.sourceforge.net/SnpSift.html

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.5 | NGS-Var Exercise.7 ]