NGS Exercise.7 vcfCodingSnps

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS_Exercise.7 | NGS Exercise.7_SnpEff | NGS Exercise.7_annovar | NGS Exercise.8 ]
# updated 2014 version


Add annotations to variant calling format with vcfCodingSnps.v1.5


vcfCodingSnps.v1.5.png

Introduction

vcfCodingSnps.v1.5 was developped to directly annotate VCF data quite similarly to Annovar. A tutorial can be found on the website as well as more info on how to prepare anotation input data. As always, care should be taken to use the same naming system for chromosomes as that used during the mapping ('chr1' or '1' depending on the reference).

command parameters

vcfCodingSnps
 
##################################################################################################
 
vcfCodingSNP 1.5 -- vcf SNP annotating tool
(c) 2010.12 (2012-01) Yanming Li, Goncalo Abecasis
Comments and (or) suggestions are welcome! Please send to liyanmin@umich.edu.
 
##################################################################################################
 
The following parameters are in effect:
 
Availabe Options
    Input Files : --refgenome [referenceGenomes/genome.V36.fa], --snpfile [],
                  --genefile [geneLists/UCSCknownGene.B36.txt], --n1 [8],
                  --n2 [3], --ns [5], --nc [24]
   Output Files : --outfile [vcfCodingSNP.out.vcf], --log
 
## parameters read as follows:
-s VCF file	This option specifies the name of the input VCF-format SNP file 
-r refgenome	This option specifies the name of the imput reference genome FASTA file. 
                It should be of either NCBI release 36/hg18 or GRCH37/hg19 format. 
                By default it will load NCBI36 reference genome. 
                Users can chose to download other versions of reference genome files at the download page 
-o outputfile	Specifies the name of the input gene file, by default use a gene file (UCSCknownGene.B36.txt) 
                generated by UCSC genome browser 
-l logfile	Specifies the name of the log file.
-g genefile	Specifies the name of the output VCF-format SNP file, by default will be named vcfCodingSNP.out.vcf
-o outputfile	Specifies the name of the log file, log file gives more detailed information for each annotated SNP, 
                by default will be named vcfCodingSNP.log
--n1 number 	user defined number of bps into intron for splice site, by default will be set to 8
--n2 number 	user defined number of bps into extron for splice site, by default will be set to 3
--ns number 	user defined number of kbps for the range of upstream or downstream of a gene, 
                by default will be set to 5

manual example illustrating vcfCodingSnps results

Annotation output Interpretation of the output
5'UTR=A26C2[-] the SNP is in the 5'UTR region of gene A26C2 with a minus strand.
INTRONIC=POTEG[-] the SNP is in the intronic region of gene POTEG with a minus strand.
SYNONYMOUS_CODING=BARD1(uc002veu.2):His506His[-] the SNP is synonymous coding at the 506th codon in gene BARD1 with a minus strand and it keeps amino-acid His unchanged.
NON_SYNONYMOUS_CODING=BARD1(uc002veu.2):Arg658Cys[-] the SNP is non_synonymous coding at the 658th codon in gene BARD1 (ucsc gene name uc002veu.2)with a minus strand and it changes amino-acid from Arg to Cys.
SPLICE_SITE=FARP2(uc002wbi.1)[+] the SNP is in the SPLICE_SITE (5 bp within exon start or end positions in the coding region) of gene FARP2 (ucsc gene name uc002wbi.1) with a plus strand.
STOP_GAINED=C2orf83(uc002vph.1):Trp141stop[-] the SNP is the 141th codon in gene MAPK12 (ucsc gene name uc002vph.1) with a minus strand and it changes amino-acid Trp to a stop codon.
STOP_LOST=OR2M3(uc001ieb.1):stop313Arg[+] the SNP is the 313th codon in gene OR2M3 (ucsc gene name uc001ieb.1) with a plus strand and it changes a stop codon to amino-acid Arg.

Example results and meaning

vcfCodingSnps_example.png
Pos Alt SNP Ref SNP Alt SNP Codon Ref SNP Codon Alt SNP AA Ref SNP AA Anno Type
3 G A . . . . 5'UTR
5 A C CAT CCT His Pro Non_Synonymous
13 G T CCG CCT Pro Pro Synonymous
25 C A TAC TAA Tyr Stop Stop Loss
43 A C . . . . Splice Site
44 G T . . . . Intronic
50 C A . . . . Essential Splice Site
53 A C ACC CCC Thr Pro Non_Synonymous
66 C T . . . . 3'UTR
72 A C . . . . Downstram

full list of annotation types obtained with vcfCodingSnps

Category Definition used in vcfCoding Snps
stop gained a SNP in coding sequence and introducing a TAG, TAA, or TGA stop codon
stop lost a SNP in coding sequence and causing a loss of a TAG, TAA, or TGA stop codon
non- synonymous coding a SNP in coding sequence, located in a codon resulting in a change of amino acid, excluding SNPs that can be defined as either stop gained or stop lost
synonymous coding a SNP in coding sequence, located in a codon that not resulting in a change of amino acid
essential splice site a SNP changing the highly conserved GU in the first two basepairs of the intron or (AG) in the last two basepair of the intron
splice site a SNP occurring in 3 - N1 basepairs into the intron, or N2 basepairs into the exon . N1 by default is 8, N2 by default is 3. N1 and N2 can be defined by user through option --n1 --n2.
5' UTR a SNP located within the 5' UTR of a transcript
3' UTR a SNP located within the 3' UTR of a transcript
intronic a SNP in the intron of a known gene, and cannot be defined as essential splice site or splice site
upstream A SNP located within N kb from the transcript start site (5'-end) of a known gene, N by default is 5 and can be defined by user through option --ns
downstream A SNP located within N kb from the transcript end site (3'-end) of a known gene, N by default is 5 and can be defined by user through option --ns
introgenic A SNP not located within a known gene and also not identified as upstream or downstream of a knowngene

Annotate the VCF file obtained by intersecting 'aln' and 'mem' calls

generate a genepred table for refseq transcripts

vcfCodingSnps needs a specific format (genepred) for the gene models. This format is close to what the UCSC table provides and can be obtained as described next.

outfolder="vcfCodingSnps-results"
mkdir -p ${outfolder}
 
# Go to http://genome.ucsc.edu/ 
#    >>> Click "table" 
#    >>> Specify the fields required (clade: mammal, genome:human etc.) 
#    >>> In "track" (popup), choose ''RefSeq Genes'' 
#    >>> In tables select "refGene" 
#    >>> for the ''output format'', select fields shown in the hidden content below
#    >>> get output gene file (hg19_refGene.genepred.txt) and move it from ''Downloads'' to ${outfolder}

UCSC table screen-shots

Reproduce these settings for the refGene track
  • UCSC table selection for the refGene track
refgene_selection.png
  • Selection of fields to export to get a genepred output
genepred_fields.png

Hit 'get output' do start downloading

You should get output like shown next with the top two records
name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds name2
NM_032291 chr1 + 66999824 67210768 67000041 67208778 25 66999824,67091529,67098752,67101626,67105459,67108492,67109226,67126195,67133212,67136677,67137626,67138963,67142686,67145360,67147551,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755, 67000051,67091593,67098777,67101698,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,67210768, SGIP1
NM_001080397 chr1 + 8378144 8404227 8378168 8404073 9 8378144,8384365,8385357,8385877,8390268,8395496,8397875,8399552,8403806, 8378246,8384786,8385450,8386102,8390996,8395650,8398052,8399758,8404227, SLC45A1 NM_032785 chr1 - 48998526 50489626 48999844 50489468 14 48998526,49000561,49005313,49052675,49056504,49100164,49119008,49128823,49332862,49511255,49711441,50162984,50317067,50489434, 48999965,49000588,49005410,49052838,49056657,49100276,49119123,49128913,49332902,49511472,49711536,50163109,50317190,50489626, AGBL4

...

Annotate the VCF data

infolder="Final_Results/vcf-venn/gold-set"
outfolder="vcfCodingSnps-results"
mkdir -p ${ourfolder}
 
infile="chr21_common_NA18507.vcf"
outfile="annotated_chr21_common_NA18507.vcf"
 
# annotation files
refgenome=ref/HiSeq_UCSC_hg19.fa
genepred=$BIOTOOLS/vcfCodingSnps/geneLists/refGene.B37.txt
 
# default parameters are:
#  --n1 [8], --n2 [3], --ns [5], --nc [24]
 
## option 1)
# we use here the long names for parameters, see above for the short names
# vcfCodingSnps does not read from bgzipped files
# decompress (-d) the VCF data but keep the original archive (-c)
bgzip -cd ${infolder}/${infile}.gz > ${outfolder}/${infile}
 
vcfCodingSnps --refgenome ${refgenome}  \
        --snpfile ${outfolder}/${infile} \
        --genefile ${genepred} \
        --outfile ${outfolder}/${outfile} \
        --log \
        2 > "${outfolder}/vcfCodingSnps_${outfile}.log"
cd
## option 2)
# we can also decompress on the fly with a redirection and short names
vcfCodingSnps -r ${refgenome}  \
        -s <(bgzip -cd ${infolder}/${infile}.gz) \
        -g ${genepred} \
        -o ${outfolder}/${outfile} \
        -l "${outfolder}/vcfCodingSnps_${outfile}.log"

bgzip command datails

Usage:   bgzip [options] [file] ...
 
Options: -c      write on standard output, keep original files unchanged
         -d      decompress
         -f      overwrite files without asking
         -b INT  decompress at virtual file pointer INT
         -s INT  decompress INT bytes in the uncompressed file
         -h      give this help

terminal output

##################################################################################################
 
vcfCodingSNP 1.5 -- vcf SNP annotating tool
(c) 2010.12 Yanming Li, Goncalo Abecasis
Commend and (or) suggestions are welcome! Please send to liyanmin@umich.edu.
 
##################################################################################################
 
The following parameters are in effect:
 
Availabe Options
    Input Files : --refgenome [ref/HiSeq_UCSC_hg19.fa], --snpfile [/dev/fd/63],
                  --genefile [/opt/biotools/vcfCodingSnps/geneLists/refGene.B37.txt],
                  --n1 [8], --n2 [3], --ns [5]
   Output Files : --outfile [vcfCodingSnps-results/annotated_chr21_common_NA18507.vcf],
                  --log [ON]
 
Reading chromosome >chrM...
Reading chromosome >chr1...
Reading chromosome >chr2...
Reading chromosome >chr3...
Reading chromosome >chr4...
Reading chromosome >chr5...
Reading chromosome >chr6...
Reading chromosome >chr7...
Reading chromosome >chr8...
Reading chromosome >chr9...
Reading chromosome >chr10...
Reading chromosome >chr11...
Reading chromosome >chr12...
Reading chromosome >chr13...
Reading chromosome >chr14...
Reading chromosome >chr15...
Reading chromosome >chr16...
Reading chromosome >chr17...
Reading chromosome >chr18...
Reading chromosome >chr19...
Reading chromosome >chr20...
Reading chromosome >chr21...
Reading chromosome >chr22...
Reading chromosome >chrX...
Reading chromosome >chrY...
start mapping
mapping snp file... ...DONE! snp mapsize = 69969
mapping gene file... ...DONE! gene mapsize = 37973
start annotating... ...
Error: The total number of bases between coding start and coding end are not right! genename=NM_182484 and base # btw cdstart and cdend=119
Error: The total number of bases between coding start and coding end are not right! genename=NM_001187 and base # btw cdstart and cdend=119
Error: The total number of bases between coding start and coding end are not right! genename=NM_181606 and base # btw cdstart and cdend=265
Error: The total number of bases between coding start and coding end are not right! genename=NM_181606 and base # btw cdstart and cdend=265
Error: The total number of bases between coding start and coding end are not right! genename=NM_181606 and base # btw cdstart and cdend=265
Error: The total number of bases between coding start and coding end are not right! genename=NM_181606 and base # btw cdstart and cdend=265
Error: The total number of bases between coding start and coding end are not right! genename=NM_181606 and base # btw cdstart and cdend=265
DONE! Complete annotating!!!

Inspect the results and isolate interesting calls

Looking for 'stop gain' calls (STOP_GAINED) in the VCF file is easy with a grep command

grep -w "STOP_GAINED" ${outfolder}/${outfile} | column -t
chr21  30339120  .  C   A  224  .  SF=0,1,2;STOP_GAINED=LTN1(NM_015565):Arg611stop[-] \
          GT:PL:GQ  0/1:254,0,255:99
chr21  31913981  .  AG  A  214  .  SF=0,1,2;STOP_GAINED=KRTAP19-6(NM_181612):Glu58stop[-] \
          GT:PL:GQ  1/1:255,135,0:99
chr21  40584598  .  C   T  225  .  SF=0,1,2;STOP_GAINED=BRWD1(NM_018963):Tyr1298stop[-];STOP_GAINED=BRWD1(NM_033656):Tyr1298stop[-] \
          GT:PL:GQ  0/1:255,0,199:99
chr21  45656920  .  C   T  223  .  SF=0,1,2;STOP_GAINED=ICOSLG(NM_015259):Leu79stop[-] \
          GT:PL:GQ  0/1:253,0,255:99
chr21  45993819  .  C   T  27   .  SF=0,1,2;INTRONIC=C21orf29(NM_144991)[-];STOP_GAINED=KRTAP10-4(NM_198687):Gln62stop[+] \
          GT:PL:GQ  0/1:57,0,70:60
Error creating thumbnail: Unable to save thumbnail to destination
remember that the same variant may have different effects depending on the splice isoform of a given gene.

download exercise files

Download exercise files here

Use the right application to open the files present in ex7-files

References:

[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.6 | NGS_Exercise.7 | NGS Exercise.7_SnpEff | NGS Exercise.7_annovar | NGS Exercise.8 ]