NGS Exercise.7 annovar

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS_Exercise.7 | NGS Exercise.7_SnpEff | NGS_Exercise.7_vcfCodingSnps | NGS Exercise.8 ]
# updated 2014 version


Add annotations to variant calling format


ex07_wf.png

Annovar overview

Variant lists are important but often long and not easy to evaluate. In order to rank candidate variant for validation, we need to know where these variants occur and what effect they may have on the regulation of genes when close or included into a gene region or on the protein product when falling into exons.

Handicon.png Whatever software you apply to clean your variant calls, you will still need to validate what you get from NGS :o)

Several tools exist to annotate variant lists and predict variant effects, we present here two popular tools

Other valuable variant annotation tools exist, including:

And many more that you can find with Dr G. search now

Annovar was written in Perl and is very straightforward to use when one has bascic knowledge of CLI and file parsing. It is not a fully automated process and requires decisions at run time. Annovar requires installing a large number of reference and annotation files on your computer and therefore was not very adapted to today training. Only few relevant files were installed. Annovar supports most if not all UCSC table databases with minor or no modifications, it is also compatible with several key data formats including BED and VCF.

A considerable amount of resources can be attached to Annovar and used for annotation and filtering. Please read the Annovar database page [10] to get a clear idea.

Publication: Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data Nucleic Acids Research, 38:e164, 2010.

Handicon.png We recommend Annovar for users that need confidentiality and/or will regularly annotate variants since it allows building workflow scripts to do so in a straightforward way.

annotate_variation.pl command parameters

Usage:
     annotate_variation.pl [arguments] <query-file|table-name> <database-location>
 
     Optional arguments:
            -h, --help                      print help message
            -m, --man                       print complete documentation
            -v, --verbose                   use verbose output
 
            Arguments to download databases or perform annotations
                --downdb                    download annotation database
                --geneanno                  annotate variants by gene-based annotation (infer functional consequence on genes)
                --regionanno                annotate variants by region-based annotation (find overlapped regions in database)
                --filter                    annotate variants by filter-based annotation (find identical variants in database)
 
            Arguments to control input and output
                --outfile <file>            output file prefix
                --webfrom <string>          specify the source of database (ucsc or annovar or URL) (downdb operation)
                --dbtype <string>           specify database type
                --buildver <string>         specify genome build version (default: hg18 for human)
                --time                      print out local time during program run
                --comment                   print out comment line (those starting with #) in output files 
                --exonsort                  sort the exon number in output line (gene-based annotation)
                --transcript_function       use transcript name rather than gene name (gene-based annotation)
                --hgvs                      use HGVS format for exonic annotation (c.122C>T rather than c.C122T) (gene-based annotation)
                --separate                  separately print out all functions of a variant in several lines (gene-based annotation)
                --seq_padding               create a new file with cDNA sequence padded by this much either side (gene-based annotation)
                --(no)firstcodondel         treat first codon deletion as wholegene deletion (default: ON) (gene-based annotation)
                --aamatrix <file>           specify an amino acid substitution matrix file (gene-based annotation)
                --colsWanted <string>       specify which columns to output by comma-delimited numbers (region-based annotation)
                --scorecolumn <int>         the column with scores in DB file (region-based annotation)
                --gff3dbfile <file>         specify a DB file in GFF3 format (region-based annotation)
                --bedfile <file>            specify a DB file in BED format file (region-based annotation)
                --gff3attribute             output all fields in GFF3 attribute (default: ID and score only)
                --genericdbfile <file>      specify a DB file in generic format (filter-based annotation)
                --vcfdbfile <file>          specify a DB file in VCF format (filter-based annotation)
                --otherinfo                 print out additional columns in database file (filter-based annotation)
                --infoasscore               use INFO field in VCF file as score in output (filter-based annotation)
 
 
            Arguments to fine-tune the annotation procedure
                --batchsize <int>           batch size for processing variants per batch (default: 5m)
                --genomebinsize <int>       bin size to speed up search (default: 100k for -geneanno, 10k for -regionanno)
                --expandbin <int>           check nearby bin to find neighboring genes (default: 2m/genomebinsize)
                --neargene <int>            distance threshold to define upstream/downstream of a gene
                --exonicsplicing            report exonic variants near exon/intron boundary as 'exonic;splicing' variants
                --score_threshold <float>   minimum score of DB regions to use in annotation
                --reverse                   reverse directionality to compare to score_threshold
                --normscore_threshold <float> minimum normalized score of DB regions to use in annotation
                --rawscore                  output includes the raw score (not normalized score) in UCSC Browser Track
                --minqueryfrac <float>      minimum percentage of query overlap to define match to DB (default: 0)
                --splicing_threshold <int>  distance between splicing variants and exon/intron boundary (default: 2)
                --indel_splicing_threshold <int>    if set, use this value for allowed indel size for splicing variants (default: --splicing_threshold)
                --maf_threshold <float>     filter 1000G variants with MAF above this threshold (default: 0)
                --sift_threshold <float>    SIFT threshold for deleterious prediction for -dbtype avsift (default: 0.05)
                --precedence <string>       comma-delimited to specify precedence of variant function (default: exonic>intronic...)
                --indexfilter_threshold <float>     controls whether filter-based annotation use index if this fraction of bins need to be scanned (default: 0.9)
 
           Arguments to control memory usage
                --memfree <int>             ensure minimum amount of free system memory (default: 0)
                --memtotal <int>            limit total amount of memory used by ANNOVAR (default: 0, unlimited, in the order of kb)
                --chromosome <string>       examine these specific chromosomes in database file
 
 
     Function: annotate a list of genetic variants against genome annotation 
     databases stored at local disk.
 
     Example: #download gene annotation database (for hg18 build) and save to humandb/ directory
              annotate_variation.pl -downdb refGene humandb/
              annotate_variation.pl -buildver mm9 -downdb  mousedb/
              annotate_variation.pl -downdb -webfrom annovar esp6500si_all humandb/
 
              #gene-based annotation of variants in the varlist file (by default --geneanno is ON)
              annotate_variation.pl ex1.human humandb/
 
              #region-based annotate variants
              annotate_variation.pl -regionanno -dbtype cytoBand ex1.human humandb/
              annotate_variation.pl -regionanno -dbtype gff3 -gff3dbfile tfbs.gff3 ex1.human humandb/
 
              #filter rare or unreported variants (in 1000G/dbSNP) or predicted deleterious variants
              annotate_variation.pl -filter -dbtype 1000g2012apr_all -maf 0.01 ex1.human humandb/
              annotate_variation.pl -filter -dbtype snp130 ex1.human humandb/
              annotate_variation.pl -filter -dbtype avsift ex1.human humandb/
 
     Version: $LastChangedDate: 2013-08-23 11:32:41 -0700 (Fri, 23 Aug 2013) $

Annovar step by step

Convert vcf to annovar format

Annovar requires a strict format for input, the format is well documented in the web documentation and a command is ready to process our VCF data as follows:

# we use here the data obtained from the tripple vcf intersect described above
infolder=Final_Results/vcf-venn/gold-set
infile=chr21_common_NA18507.vcf.gz
outfolder=annovar-results
mkdir -p ${outfolder}
 
# we now decompress the data 'on the fly' and feed it to annovar
convert2annovar.pl -format vcf4 \
	<(zcat ${infolder}/${infile}) > \
	${outfolder}/chr21_common_NA18507.annovar
NOTICE: Finished reading 69940 lines from VCF file
NOTICE: A total of 69851 locus in VCF file passed QC threshold, representing 62895 SNPs \
    (42796 transitions and 20099 transversions) and 7074 indels/substitutions
NOTICE: Finished writting 62886 SNPs (42793 transitions and 20093 transversions) and 6965 indels/substitutions for 1 sample

The obtained file is ready for annotation. We will use the hg19 RefSeq gene table to add information in the intergenic and exonic parts hit by our chr21 variants.

# inspect the first 5 calls in the VCF file
infolder=Final_Results/vcf-venn/gold-set
infile=chr21_common_NA18507.vcf.gz
 
zgrep -v "^##" ${infolder}/${infile} | head -6 | column -t
#CHROM  POS      ID  REF  ALT  QUAL  FILTER  INFO      FORMAT    GAIIx-chr21-BWA.mem
chr21   9467416  .   C    T    86    .       SF=0,1    GT:PL:GQ  0/1:116,0,126:99
chr21   9467417  .   A    C    111   .       SF=0,1,2  GT:PL:GQ  0/1:141,0,95:98
chr21   9471670  .   A    G    37.8  .       SF=0,1,2  GT:PL:GQ  1/1:69,6,0:10
chr21   9472931  .   T    G    52    .       SF=0,1,2  GT:PL:GQ  1/1:84,9,0:16
chr21   9473159  .   A    G    52    .       SF=0,1,2  GT:PL:GQ  1/1:84,9,0:16
# inspect the top-5 lines in the annovar result ( | column -t to make it pretty)
infolder=annovar-results
 
head -5 ${infolder}/chr21_common_NA18507.annovar | column -t
chr21  9467416  9467416  C  T  het  86    .
chr21  9467417  9467417  A  C  het  111   .
chr21  9471670  9471670  A  G  hom  37.8  .
chr21  9472931  9472931  T  G  hom  52    .
chr21  9473159  9473159  A  G  hom  52    .

As seen by comparing both extracts, the before last 2 columns report respectively, the ploidy of the call, and the quality score.

The next step will use Annovar to find variants faling in RefSeq exons or close to a RefSeq gene.

Handicon.png no need to run the following command, it was done already for you

# obtain up-to-date annotation from the annovar file server for hg19:RefSeq
annovardb=$BIOTOOLS"/annovar/humandb/"
#not run# annotate_variation.pl -buildver hg19 -downdb refGene ${annovardb}

When run, the command reports downloading several files

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz ... OK
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refLink.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_refGeneMrna.fa.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the '/opt/biotools/annovar/humandb' directory

Genomic annotation of the converted data using Annovar

# define path
infolder=Final_Results/annovar-results
outfolder=annovar-results
mkdir -p ${outfolder}
 
annovardb=$BIOTOOLS"/annovar/humandb/"
 
# start annotating
annotate_variation.pl \
	--buildver hg19 \
	--geneanno \
	--outfile ${outfolder}/common_NA18507 \
	${infolder}/chr21_common_NA18507.annovar \
	${annovardb}

Technical.png The comment reports downloading several files and complains about discrepancies between our ref alleles and bases in annovar hg19 fasta file. This is often seen and reflects discrepancies between genbank mRNA records used for annotation and the reference genome sequence (no worries)

NOTICE: Reading gene annotation from /opt/biotools/annovar/humandb/hg19_refGene.txt ... Done with 47806 transcripts \
    (including 9829 without coding sequence annotation) for 25224 unique genes
NOTICE: Reading FASTA sequences from /opt/biotools/annovar/humandb/hg19_refGeneMrna.fa ... Done with 277 sequences
WARNING: A total of 3 sequences cannot be found in /opt/biotools/annovar/humandb/hg19_refGeneMrna.fa
 (example: NM_001291412 NM_001291411 NM_001290224)
WARNING: A total of 301 sequences will be ignored due to lack of correct ORF annotation
 
---------------------------------------------------------------------------------------
WARNING: 7 exonic SNPs have WRONG reference alleles specified in your input file!
WARNING: An example input line is <chr21	35467645	35467645	A	G	hom	222	.>
WARNING: ANNOVAR can still annotate exonic_variant_function for the mutation correctly!
WARNING: you may have used wrong -buildver, or specified incorrect reference allele, or used outdated mRNA FASTA file!
---------------------------------------------------------------------------------------
 
NOTICE: Finished gene-based annotation on 69851 genetic variants in annovar-results/chr21_common_NA18507.annovar
NOTICE: Output files were written to annovar-results/common_NA18507.variant_function, \
    annovar-results/common_NA18507.exonic_variant_function

Intergenic variants annotated by Annovar

infolder=annovar-results
head -5 ${infolder}/common_NA18507.variant_function | column -t

command results

intergenic  NONE(dist=NONE),MIR3648(dist=358416)  chr21  9467416  9467416  C  T  het  86    .
intergenic  NONE(dist=NONE),MIR3648(dist=358415)  chr21  9467417  9467417  A  C  het  111   .
intergenic  NONE(dist=NONE),MIR3648(dist=354162)  chr21  9471670  9471670  A  G  hom  37.8  .
intergenic  NONE(dist=NONE),MIR3648(dist=352901)  chr21  9472931  9472931  T  G  hom  52    .
intergenic  NONE(dist=NONE),MIR3648(dist=352673)  chr21  9473159  9473159  A  G  hom  52    .

Note that all the information present in the vcf-converted input is reported in annovar annotation files. This information can now be used to filter and decide of which variants to believe and validate.

Handicon.png You can easily create a bed formatted version of this output with gawk that can be viewed in IGV. The score column uses the quality score Q and the depth is stored in the name (empty in this case)

Process intergenic calls

When columns need to be reordered, AWK is here to help, the following command presents the first line of input with numbers to help us build a correct awk script.

infolder=Final_Results/annovar-results
 
testfile=${infolder}/common_NA18507.variant_function
 
head -1 ${testfile} | transpose -t | awk '{print NR,$0}'
 
1 intergenic
2 NONE(dist=NONE),MIR3648(dist=358416)
3 chr21
4 9467416
5 9467416
6 C
7 T
8 het
9 86
10 .
infolder=Final_Results/annovar-results
outfolder=annovar-results
mkdir -p ${outfolder}
 
# switch back to 0-based open coordinates
# replace spaces by '_'  in name field
gawk 'BEGIN{FS="\t"; OFS="\t"}
{
namef=$6"/"$7"|"$8"|"$2"|"$3"|DP="$10
gsub(/ /, "_", namef)
print $3, $4-1, $5, namef ,"+", $9
}' ${infolder}/common_NA18507.variant_function > \
	${outfolder}/common_NA18507.variant_function.bed
infolder=Final_Results/annovar-results
 
head -5 ${infolder}/common_NA18507.variant_function.bed

intergenic BED results

==> common_NA18507.variant_function.bed <==
chr21	9467415	9467416	C/T|het|NONE(dist=NONE),MIR3648(dist=358416)|chr21|DP=.	+	86
chr21	9467416	9467417	A/C|het|NONE(dist=NONE),MIR3648(dist=358415)|chr21|DP=.	+	111
chr21	9471669	9471670	A/G|hom|NONE(dist=NONE),MIR3648(dist=354162)|chr21|DP=.	+	37.8
chr21	9472930	9472931	T/G|hom|NONE(dist=NONE),MIR3648(dist=352901)|chr21|DP=.	+	52
chr21	9473158	9473159	A/G|hom|NONE(dist=NONE),MIR3648(dist=352673)|chr21|DP=.	+	52

Exomic variants annotated by Annovar

The obtained files report variants with additional fields where the coding was added for CDS variants and gene symbols for all exomic calls.

As seen below, the added columns inform about the type of variation, the change as compared to the RefSeq mRNA sequence, and finally, when applies, the change at protein level (incl frameshifts and added/removed start and stop CDS). A number of possible annotations are listed on the annovar web site. (please read the full documentation).

infolder=annovar-results
head -5 ${infolder}/common_NA18507.exonic_variant_function
bits@bits-Aspire-V3-571G ~/NGS_DNASeq-training $ head -5 ${infolder}/common_NA18507.exonic_variant_function 
line1351	stopgain SNV	TPTE:NM_199259:exon23:c.C1518A:p.C506X,TPTE:NM_199260:exon22:c.C1458A:p.C486X,TPTE:\
    NM_199261:exon24:c.C1572A:p.C524X,	chr21	10906989	10906989	C	T	het	22.
line1369	stoploss SNV	TPTE:NM_199259:exon21:c.G1391C:p.X464S,TPTE:NM_199260:exon20:c.G1331C:p.X444S,TPTE:\
    NM_199261:exon22:c.G1445C:p.X482S,	chr21	10910311	10910311	T	G	hom	22.
line1370	synonymous SNV	TPTE:NM_199259:exon21:c.T1362C:p.V454V,TPTE:NM_199260:exon20:c.T1302C:p.V434V,TPTE:\
    NM_199261:exon22:c.T1416C:p.V472V,	chr21	10910340	10910340	C	G	het	13.
line1371	nonsynonymous SNV	TPTE:NM_199259:exon21:c.G1355C:p.S452T,TPTE:NM_199260:exon20:c.G1295C:p.S432T,TPTE:\
    NM_199261:exon22:c.G1409C:p.S470T,	chr21	10910347	10910347	A	G	ho222	.
line1413	nonsynonymous SNV	TPTE:NM_199259:exon18:c.T1102G:p.C368G,TPTE:NM_199260:exon17:c.T1042G:p.C348G,TPTE:\
    NM_199261:exon19:c.T1156G:p.C386G,	chr21	10920098	10920098	T	C	ho222	.

Process exomic calls (! fields are different with one additional column in the Annovar file)

infolder=Final_Results/annovar-results
outfolder=annovar-results
mkdir -p ${outfolder}
 
# switch back to 0-based open coordinates
# replace spaces by '_'  in name field
gawk 'BEGIN{FS="\t"; OFS="\t"}
{
namef=$7"/"$8"|"$9"|"$2"|"$3"|DP="$11
gsub(/ /, "_", namef)
print $4, $5-1, $6, namef, "+", $10
}' ${infolder}/common_NA18507.exonic_variant_function > \
	${outfolder}/common_NA18507.exonic_variant_function.bed
infolder=Final_Results/annovar-results
head -5 ${infolder}/common_NA18507.exonic_variant_function.bed

exomic BED results

==> common_NA18507.exonic_variant_function.bed <==
chr21	10906988	10906989	C/T|het|stopgain_SNV|TPTE:NM_199259:exon23:c.C1518A:p.C506X,\
                                        TPTE:NM_199260:exon22:c.C1458A:p.C486X,TPTE:NM_199261:exon24:c.C1572A:p.C524X,|DP=.	\
                                        +	225
chr21	10910310	10910311	T/G|hom|stoploss_SNV|TPTE:NM_199259:exon21:c.G1391C:p.X464S,\
                                        TPTE:NM_199260:exon20:c.G1331C:p.X444S,\
                                        TPTE:NM_199261:exon22:c.G1445C:p.X482S,|DP=.	\
                                        +	222
chr21	10910339	10910340	C/G|het|synonymous_SNV|TPTE:NM_199259:exon21:c.T1362C:p.V454V,\
                                        TPTE:NM_199260:exon20:c.T1302C:p.V434V,TPTE:NM_199261:exon22:c.T1416C:p.V472V,|DP=.	\
                                        +	139
chr21	10910346	10910347	A/G|hom|nonsynonymous_SNV|TPTE:NM_199259:exon21:c.G1355C:p.S452T,\
                                        TPTE:NM_199260:exon20:c.G1295C:p.S432T,TPTE:NM_199261:exon22:c.G1409C:p.S470T,|DP=.	\
                                        +	222
chr21	10920097	10920098	T/C|hom|nonsynonymous_SNV|TPTE:NM_199259:exon18:c.T1102G:p.C368G,\
                                        TPTE:NM_199260:exon17:c.T1042G:p.C348G,TPTE:NM_199261:exon19:c.T1156G:p.C386G,|DP=.	\
                                        +	222

Handicon.png This BED file can be loaded in IGV very easily to create a track used to guide you through the exploration of each variant

Identify candidate deleterious variants

As we work today with genome information from a healthy donnor (1000g YRI person), we do not really expect finding disease associated variants.

We will however look for potentially detrimental exomic mutations based on gene annotations produced by annovar. One way to do this is to first identify interesting annotation categories in the annovar results then find where they occur in the genome. A simple example is provided here to find all possible exomic effects stored in column #2 of the annovar raw data.

infolder=Final_Results/annovar-results
cut -f 2 ${infolder}/common_NA18507.exonic_variant_function | \
	sort | uniq -c | sort -n

Handicon.png Note the second sort -n to rank the lines by counts; as you see here again, piping different commands in a meaningful sequence is very powerful and has no limits

      1 frameshift insertion
      1 nonframeshift insertion
      2 frameshift deletion
      2 stopgain SNV
      2 stoploss SNV
      3 nonframeshift deletion
    172 synonymous SNV
    187 nonsynonymous SNV

Searching for the unique 'stop-loss' call is easy with a grep command on the BED file (could also be done on the raw data).

infolder=annovar-results
grep "stoploss" ${infolder}/common_NA18507.exonic_variant_function.bed
chr21	42821113	42821113	T/C|het|stoploss SNV|MX1:NM_001178046:exon12:c.A1323C:p.X441Y,|DP=.	+	225
chr21	45994841	45994841	A/C|het|stoploss SNV|KRTAP10-4:NM_198687:exon1:c.A1206C:p.X402C,|DP=.	+	124

Technical.png Slight differences between RefSeq files used by Annovar and by other tools (SnpEff) will lead to slight differences in results. Some splice variants might be erroneously annotated in one file or simply absent leading to different VCF annotations added. You should always control selected results by eye using IGV to ensure that the reads really support the annotation

Searching for 'synonymous calls' is easy with a grep -w command on the annovar raw data or BED data.

infolder=Final_Results/annovar-results
 
# on raw data, the space before SNV can be used
grep -w "synonymous" ${infolder}/common_NA18507.exonic_variant_function.bed | head -5
 
# on BED data, _SNV shoul dbeadded
grep -w "synonymous_SNV" ${infolder}/common_NA18507.exonic_variant_function.bed | head -5
chr21	10910339	10910340	C/G|het|synonymous_SNV|TPTE:NM_199259:exon21:c.T1362C:p.V454V,\
                                        TPTE:NM_199260:exon20:c.T1302C:p.V434V,\
                                        TPTE:NM_199261:exon22:c.T1416C:p.V472V,|DP=.	\
                                        +	139
chr21	10970016	10970017	C/T|het|synonymous_SNV|TPTE:NM_199259:exon6:c.T111A:p.T37T,\
                                        TPTE:NM_199260:exon6:c.A111A:p.E37E,\
                                        TPTE:NM_199261:exon6:c.A111A:p.X37X,|DP=.	\
                                        +	225
chr21	15918576	15918577	G/C|hom|synonymous_SNV|SAMSN1:NM_022136:exon1:c.C6G:p.L2L,|DP=.	\
                                        +	222
chr21	16340288	16340289	C/T|het|synonymous_SNV|NRIP1:NM_003489:exon4:c.G225A:p.G75G,|DP=.	\
                                        +	225
chr21	17203784	17203785	A/T|het|synonymous_SNV|USP25:NM_001283041:exon16:c.A1830T:p.A610A,\
                                        USP25:NM_013396:exon16:c.A1830T:p.A610A,\
                                        USP25:NM_001283042:exon16:c.A1830T:p.A610A,|DP=.	\
                                        +	153

Handicon.png When using 'grep', adding the -w option requires that the searched word be surrounded by spaces or line ends, this avoids finding a string within a longer word (eg 'synonymous' within 'nonsynonymous')

IGV is used to confirm the call from the aligned reads (chr21:45994841-45994841, at the end of the TSPEAR gene)

IGV at the position of the call

chr21_45994841-45994841_stoploss.png

About half of the reads support the het call.

Handicon.png Any other type of interesting variant SHOULD be inspected similarly before claiming victory

NOTE: Annovar is not limited to annotating your data, it can also filter it to keep or trash variants based on genomic context or any external annotation database you can provide. Please refer to the Annovar online documentation for more info. As incentive, the next block of text was taken from the site and shows particular Annovar filters (non exhaustive).

ANNOVAR Function II: Region-based annotation
 
- Most conserved element annotation
- Transcription factor binding site annotation
- Identify cytogenetic band for genetic variants
- Identify variants located in segmental duplications
- Identify previously reported structural variants in DGV (Database of Genomic Variants)
- Identify variants reported in previously published GWAS (Genome-wide association studies
- Identify variants in ENCODE annotated regions (transcribed regions, 
	H3K4Me1 regions, H3K4Me3 regions, H3K27Ac regions, DNaseI hypersensitivity	
	regions, transcription factor ChIP-Seq regions, etc)
- Identify dbSNP variants in user-specified regions
- Identify variants in other genomic regions annotated with other functions
- Annotating custom-made databases conforming to GFF3 (Generic Feature Format version 3)
- Identifying variants specified in regions in BED file

Filter dbSNP variants based on dbSNP.138

As example, the Annovar filter command is applied here with the dbSNP137 database obtained from the UCSC table repository.

Handicon.png It is known that dbSNP versions > .129 contain disease variants and are not really suited for filtering

Recently, the UCSC team produced different subsets of dbSNP files including a set limited to known disease variants (flagged) and a set excluding known disease variants (common) that are better suited for filtering. Please refer to the UCSC table page dedicated to variants (see screenshot in the collapsed area below and [11]) for more info (your can download .txt.gz files from your browserfrom the UCSC FTP page[12])

dbSNP subsets

This track contains information about a subset of the single nucleotide polymorphisms and small insertions and deletions (indels) — collectively Simple Nucleotide Polymorphisms — from dbSNP build 138, available from ftp.ncbi.nih.gov/snp. Only SNPs that have a minor allele frequency of at least 1% and are mapped to a single location in the reference genome assembly are included in this subset. Frequency data are not available for all SNPs, so this subset is incomplete.

The selection of SNPs with a minor allele frequency of 1% or greater is an attempt to identify variants that appear to be reasonably common in the general population. Taken as a set, common variants should be less likely to be associated with severe genetic diseases due to the effects of natural selection, following the view that deleterious variants are not likely to become common in the population. However, the significance of any particular variant should be interpreted only by a trained medical geneticist using all available information.

The dbSNP.138 data is present in the following tracks:
* All SNPs(138) - all SNPs from dbSNP mapping to reference assembly.
 
* Common SNPs(138) - SNPs with >= 1% minor allele frequency (MAF), 
    mapping only once to reference assembly.
* Flagged SNPs(138) - SNPs < 1% minor allele frequency (MAF) (or unknown), 
    mapping only once to reference assembly, 
    flagged in dbSnp as "clinically associated" -- not necessarily a risk allele!
* Mult. SNPs(138) - SNPs mapping in more than one place on reference assembly.

The table page looks as follows and allows to download data in your browser

dbSNP_table.png

Handicon.png Note that all other tracks presented here are also suited for filtering

infolder=Final_Results/annovar-results
infile=chr21_common_NA18507.annovar
 
# annovar DB folder
annovardb=$BIOTOOLS"/annovar/humandb/"
# choose version here
dbsnpver=138
 
# download 'all' and 'common' dbSNP.${dbsnpver} data from the UCSC
 
# 1] download full database (~2GB gz file!! this can take some time!)
# the following lines were commented out to prevent you from re-running them ;-)
## annotate_variation.pl \
##     --buildver hg19 \
##     --downdb snp${dbsnpver} \
##     ${annovardb}
 
# filter all dbSNP.${dbsnpver} variants
# NOT RUN either, read comment below
## annotate_variation.pl \
##     --buildver hg19 \
##     --filter \
##     --dbtype snp${dbsnpver} \
##     ${infolder}/${infile} \
##     ${annovardb}

Cli tools.png It is SAFER to consider only common SNVs with >1% minor allele variants (unlikely disease-associated)

# 2] download common records only
# the following lines were commented out to prevent you from re-running them ;-)
## annotate_variation.pl \
##     --buildver hg19 \
##     --downdb snp${dbsnpver}Common \
##     ${annovardb}
 
infolder=Final_Results/annovar-results
infile=chr21_common_NA18507.annovar
 
# filter common dbSNP.${dbsnpver} variants
annotate_variation.pl \
    --buildver hg19 \
    --filter \
    --dbtype snp${dbsnpver}Common \
    ${infolder}/${infile} \
    ${annovardb}
# 3] other databases present on the UCSC ftp server include 'Flagged' and 'Mult' subsets 
# corresponding to the disease variants and SNPS mapping to multiple locations (likely artefacts).

The run produces two files, one for variants that correspond to the filter database (dropped) and one for the variants absent from the databases (filtered). The files are named by the input name followed by indication of the build and database used as follows (<infile>.<build>_<snp${dbsnpver}>_dropped|filtered)

filtering results for dbSNP.138_Common

NOTICE: Variants matching filtering criteria are written to 
    annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_dropped, 
    other variants are written to annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_filtered
NOTICE: Processing next batch with 69851 unique variants in 69851 input lines
NOTICE: Database index loaded. Total number of bins is 2719867 and the number of bins to be scanned is 25890
NOTICE: Scanning filter database /opt/biotools/biodata/humandb/hg19_snp138Common.txt...Done
 
# resulting records are as follows:
wc -l annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_*
  55634 annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_dropped
  14217 annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_filtered
  69851 total
 
head -5 annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_dropped | column -t
snp138Common  rs28880928  chr21  10703925  10703925  T  C  hom  222   .
snp138Common  rs76134637  chr21  10703974  10703974  C  A  het  225   .
snp138Common  rs76601778  chr21  10707729  10707729  T  G  het  163   .
snp138Common  rs28844326  chr21  10707784  10707784  C  T  het  191   .
snp138Common  rs74784077  chr21  10707945  10707945  C  T  het  18.1  .
 
head -5 annovar-results/chr21_common_NA18507.annovar.hg19_snp138Common_filtered | column -t
chr21  32548924  32548924  -   TAAACCAAG  het  217  .
chr21  34394243  34394243  C   A          het  214  .
chr21  28674151  28674151  T   -          het  217  .
chr21  9476303   9476303   G   T          hom  63   .
chr21  37876293  37876294  GA  -          het  217  .

BEDtools to create a subset of the original VCF from a subset of annotated records

As seen above, converting VCF for Annovar use leads to data loss that may be good to have in downstream analysis. One way to subset the full VCF is to intersect it with a subset of Annovar annotated lines using Bedtools.

infolder=vcf-venn/gold-set
outfolder=annovar-results
mkdir -p ${outfolder}
 
infile=chr21_common_NA18507.vcf.gz
 
# isolate the nonsynonymous calls counted above to create a filter set
# 187 nonsynonymous SNV
grep "nonsynonymous" ${outfolder}/common_NA18507.exonic_variant_function.bed > ${outfolder}/nonsynonymous_subset.bed
 
## INTERSECT with BEDTOOLS
# run bedtools intersect with specific options
#   -header replicates the vcf header in the output
#   -wa reports entries in 'a' = vcf file
#   -f 1 requires 100% reciprocal intersect (identity)
bedtools intersect -wa -header -f 1 -a ${infolder}/${infile} \
      -b ${outfolder}/nonsynonymous_subset.bed >\
      ${outfolder}/nonsynonymous_subset.vcf
 
# count results
grep -v "^#" -c ${outfolder}/nonsynonymous_subset.vcf
187

Technical.png It occures that different row counts are found when using differently formatted data. This is a consequence of the BED format being 0-based open and the VCF 1-based open and is not what we expect. When intersecting VCF data, it is preferable to use VCFtools rather than BED tools

VCFtools to create a subset of the original VCF from a subset of annotated records

infolder=vcf-venn/gold-set
infile=chr21_common_NA18507.vcf
outfolder=annovar-results
mkdir -p ${outfolder}
 
# isolate the nonsynonymous calls counted above to create a filter set
grep -c "nonsynonymous" ${outfolder}/common_NA18507.exonic_variant_function.bed
# 187 nonsynonymous SNV
grep "nonsynonymous" ${outfolder}/common_NA18507.exonic_variant_function.bed > ${outfolder}/nonsynonymous_subset.bed
 
## INTERSECT with VCFTOOLS
# decompress the vcf.gz file to the output folder
# vcftools sometimes accepts .gz files ... with --gzvcf but this is here an issue
gzip -cd ${infolder}/${infile}.gz > ${outfolder}/${infile}
 
vcftools --vcf ${outfolder}/${infile} \
      --out ${outfolder}/vcftools-nonsynonymous_subset \
      --recode \
      --bed ${outfolder}/nonsynonymous_subset.bed
VCFtools - v0.1.12a
(C) Adam Auton and Anthony Marcketta 2009
 
Parameters as interpreted:
	--vcf annovar-results/chr21_common_NA18507.vcf
	--out annovar-results/vcftools-nonsynonymous_subset
	--recode
	--bed annovar-results/nonsynonymous_subset.bed
 
After filtering, kept 1 out of 1 Individuals
Outputting VCF file...
	Read 187 BED file entries.
After filtering, kept 186 out of a possible 69845 Sites
Run Time = 0.00 seconds
# count results
grep -v "^#" -c ${outfolder}/vcftools-nonsynonymous_subset.recode.vcf
186

download exercise files

Download exercise files here

Use the right application to open the files present in ex7-files

References:
  1. http://www.openbioinformatics.org/annovar/
  2. Kai Wang, Mingyao Li, Hakon Hakonarson
    ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.
    Nucleic Acids Res: 2010, 38(16);e164
    [PubMed:20601685] ##WORLDCAT## [DOI] (I p)

  3. http://www.ensembl.org/info/docs/variation/vep/index.html
  4. http://genome.ucsc.edu/cgi-bin/hgVai
  5. http://www.sph.umich.edu/csg/liyanmin/vcfCodingSnps
  6. http://snpeff.sourceforge.net
  7. http://varianttools.sourceforge.net/Annotation/HomePage
  8. http://vat.gersteinlab.org/download.php
  9. http://www.bioconductor.org/packages/2.12/bioc/html/VariantAnnotation.html
  10. http://www.openbioinformatics.org/annovar/annovar_db.html
  11. http://genome.ucsc.edu/cgi-bin/hgTables
  12. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/

[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS_Exercise.7 | NGS Exercise.7_SnpEff | NGS_Exercise.7_vcfCodingSnps | NGS Exercise.8 ]