NGS-Var Exercise.7
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.6 ]
Review variants and annotations in IGV
Contents
Genome visualization tools
What tool should you use?
Many popular visualization tools are available that present data in ways it can be better evaluated. We list below a coupe of open-source free products, many other free or commercial tools exist but listing them all is not the focus of this document.
- Circos is a very popular graphic package to display circular plots with annotations . Please see one simple Circos example NGS_Exercise.4
- A recent paper describes the Personal Genome Browser (PGB) designed to visualize human variants (http://www.pgbrowser.org/). As NA18507 is part of the available genomes, you can use this resource to confirm variants found during this session and add our VCF data to the browser. You may also want to read the accompanying paper [1] that lists many more tools that can be used in the same context.
- Other related tools are cited in the PGB publication cited above or in A survey of tools for variant analysis of next-generation genome sequencing data [2]
IGV: The Interactive Genome Viewer
Some users may prefer other gene browsers and are welcome to keep using them but we chose IGV as favorite genome viewer because it is well documented, already contains links to precious public datasets including Encode and UCSC tacks and because it is easy to deploy, fast and robust.
- IGV supports a large number of standard formats that can be found on their information pages (<http://www.broadinstitute.org/igv/FileFormats>).
- Loading data into IGV is simple and often makes use of the IGVTools; also available from the Broad institute. IGV tools are efficient when tiling or indexing is required to speed-up data parsing (large bed files). Other lighter formats can be opened as such without performance drop.
- IGV can also be used in combination with bedtools to export screen-shots in batch for loci listed in a bed file. To discover this command, please type *bedtools igv -h* (previously *bedToIgv*) in your terminal window for more information about this feature.
IGV will be used during the session and is not further documented here, please refer to the very extensive online help content on the 'omnipresent' Broad Institute site for more information. (<http://www.broadinstitute.org/igv>).
igvtools from the command line
Besides the visualization Java application, you will also need to prepare data obtained from third-party. A Java GUI IGVTools application is provided from within IGV but is not always adapted to your needs. ‘’’igvtools’’’ can also be used at command line to compress, index or count from data.
igvtools command details
Program: igvtools. IGV Version 2.3.68 (81)01/13/2016 05:05 PM Usage: igvtools [command] [options] [input file/dir] [other arguments] Command: version print the version number sort sort an alignment file by start position. index index an alignment file toTDF convert an input file (cn, gct, wig) to tiled data format (tdf) count compute coverage density for an alignment file formatexp center, scale, and log2 normalize an expression file gui Start the gui help <command> display this help message, or help on a specific command See http://www.broadinstitute.org/software/igv/igvtools_commandline for more detailed help
example: from our mapping data zoom into the first exon of DSCAM
We arbitrarily choose a position in chr21 (copy paste the following address in IGV chr21:42,218,289-42,219,325) and load the mapping data (shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam) from file to view reads mapped to this region.
basic read mapping view in DSCAM1
Now add the variant calls obtained with bcftools from the file (chr21_NA18507_var_bcftools.flt-D1000.vcf.gz)
read mapping view with VCF data
example: add public data to explain and support private results
The example above with BWA mem shows a region where no reads mapped to the first exon of DSCAM and should be explainable using genome features in that region. Two common reasons for the absence of mapping are the presence of repeated elements and a very high level of GC in a locus. To support these, public tracks can be added from the built-in resources accessible through >File >Load from Server
adding public annotation stacks
Rearrange the mapping track as collapsed to save some space and move them down in the IGV window. It is clear that both GC% and simple repeats of the (CCG)n and GC_rich are present in the gap and are probably responsible for its existence.
result of adding public tracks
Now remove repeat tracks and add instead variation tracks from the server, you will find that our SNP call is known in dbSNP as rs9975082.
result of adding public variants
Extract parts of data files using samtools and bedtools
In this exercise, we will prepare samples from whole genome data files that can be viewed or used in many situations. Many more tools are available and we strongly encourage you to discover them. Have a look to the great online Bedtools2 documentation here[3]
example: samtools faidx to fetch a region of the reference genome
This operation is often required to get the context of a variant and possibly align it to other sequences for control. The samtools faidx indexing of the reference genome allows very rapid retrieval of sequence using the same command with start and stop coordinates. An example below extracts 50bps of reference sequence either sides of the variant cited above.
IGV view of a hom variant in the first DSCAM exon
- a track for bcftools calls
- the track showing all reads mapped in the BWA mem experiment
- the read coverage depth derived by IGV from the BAM track and showing no coverage around the Start codon (while Complete Genomics reads did show significant coverage in the same window - see above)
# requires indexing the reference file (done) # => samtools faidx ref/HiSeq_UCSC_hg19.fa # the following command adds 50bps at either ends of the DSCAM variant position: chr21: 42218928 reference=ref/HiSeq_UCSC_hg19.fa target="chr21:42218878-42218978" samtools faidx ${reference} ${target} >chr21:42218878-42218978 TCGGCAGGTGGAGAGAGCCGCAGACGCGGGCGTGAGCTCATCCCGGGCACTCGGCGCCCA GAATGCGCAGCCAGCGGCCTCCCGTGCGCCAGCCCTCACCC
IGV view around a variant
- translation of the currently viewed 5'-UTR (not relevnant! => alway inspect the type of exon displayed and its orientation)
- the read coverage depth derived by IGV from the BAM track
- the variant metrics extracted from the BWA-mem VCF track (mouse over the red histogram bar in the vcf-track to get this popup window)
example: bedtools getfasta to fetch multiple regions of the reference genome
What if you need many regions, obtained from bedtools manipulations or other sources? bedtools getfasta will store the hg19 DNA sequence of each region defined above into a multifasta file
getfasta command details
Tool: bedtools getfasta (aka fastaFromBed) Version: v2.25.0 Summary: Extract DNA sequences into a fasta file based on feature coordinates. Usage: bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> -fo <fasta> Options: -fi Input FASTA file -bed BED/GFF/VCF file of ranges to extract from -fi -fo Output file (can be FASTA or TAB-delimited) -name Use the name field for the FASTA header -split given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons) -tab Write output in TAB delimited format. - Default is FASTA format. -s Force strandedness. If the feature occupies the antisense, strand, the sequence will be reverse complemented. - By default, strand information is ignored. -fullHeader Use full fasta header. - By default, only the word before the first space or tab is used.
# download only coding exons for hg19 chr21 using the UCSC table browser in BED format infolder=ref infile=hg19_chr21-refGene-CDS.bed head -5 ${infolder}/${infile} chr21 10906904 10907040 NM_199259_cds_0_0_chr21_10906905_r 0 - chr21 10908824 10908895 NM_199259_cds_1_0_chr21_10908825_r 0 - chr21 10910306 10910399 NM_199259_cds_2_0_chr21_10910307_r 0 - chr21 10914362 10914442 NM_199259_cds_3_0_chr21_10914363_r 0 - chr21 10916369 10916475 NM_199259_cds_4_0_chr21_10916370_r 0 - # the result is a redundant bed where alternative gene models share common exons # sort the lines by chr+coord sort -k 1V,1 -k 2n,2 -k 3n,3 ${infolder}/${infile} > ${infolder}/srt_${infile} head -5 ${infolder}/srt_${infile} chr21 10906904 10907040 NM_199259_cds_0_0_chr21_10906905_r 0 - chr21 10906904 10907040 NM_199260_cds_0_0_chr21_10906905_r 0 - chr21 10906904 10907040 NM_199261_cds_0_0_chr21_10906905_r 0 - chr21 10908824 10908895 NM_199259_cds_1_0_chr21_10908825_r 0 - chr21 10908824 10908895 NM_199260_cds_1_0_chr21_10908825_r 0 -
example: bedtools merge to collapse redundant annotations
bedtools merge will collapse annotations based on a field
bedtools merge command details
Tool: bedtools merge (aka mergeBed) Version: v2.25.0 Summary: Merges overlapping BED/GFF/VCF entries into a single interval. Usage: bedtools merge [OPTIONS] -i <bed/gff/vcf> Options: -s Force strandedness. That is, only merge features that are on the same strand. - By default, merging is done without respect to strand. -S Force merge for one specific strand only. Follow with + or - to force merge from only the forward or reverse strand, respectively. - By default, merging is done without respect to strand. -d Maximum distance between features allowed for features to be merged. - Def. 0. That is, overlapping & book-ended features are merged. - (INTEGER) - Note: negative values enforce the number of b.p. required for overlap. -c Specify columns from the B file to map onto intervals in A. Default: 5. Multiple columns can be specified in a comma-delimited list. -o Specify the operation that should be applied to -c. Valid operations: sum, min, max, absmin, absmax, mean, median, collapse (i.e., print a delimited list (duplicates allowed)), distinct (i.e., print a delimited list (NO duplicates allowed)), distinct_sort_num (as distinct, sorted numerically, ascending), distinct_sort_num_desc (as distinct, sorted numerically, desscending), distinct_only (delimited list of only unique values), count count_distinct (i.e., a count of the unique values in the column), first (i.e., just the first value in the column), last (i.e., just the last value in the column), Default: sum Multiple operations can be specified in a comma-delimited list. If there is only column, but multiple operations, all operations will be applied on that column. Likewise, if there is only one operation, but multiple columns, that operation will be applied to all columns. Otherwise, the number of columns must match the the number of operations, and will be applied in respective order. E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5, the mean of column 4, and the count of column 6. The order of output columns will match the ordering given in the command. -delim Specify a custom delimiter for the collapse operations. - Example: -delim "|" - Default: ",". -prec Sets the decimal precision for output (Default: 5) -bed If using BAM input, write output as BED. -header Print the header from the A file prior to results. -nobuf Disable buffered output. Using this option will cause each line of output to be printed as it is generated, rather than saved in a buffer. This will make printing large output files noticeably slower, but can be useful in conjunction with other software tools and scripts that need to process one line of bedtools output at a time. -iobuf Specify amount of memory to use for input buffer. Takes an integer argument. Optional suffixes K/M/G supported. Note: currently has no effect with compressed files. Notes: (1) The input file (-i) file must be sorted by chrom, then start.
# merge overlapping exons and keep alternative names separated by '|' bedtools merge -c 4 -o distinct_only -delim "|" -i ${infolder}/srt_${infile} > ${infolder}/merged_${infile} head -5 ${infolder}/merged_${infile} | sed -e "s/.\{60\}/&\n/g" chr21 10906904 10907040 NM_199259_cds_0_0_chr21_10906905_r|N M_199260_cds_0_0_chr21_10906905_r|NM_199261_cds_0_0_chr21_10 906905_r chr21 10908824 10908895 NM_199259_cds_1_0_chr21_10908825_r|N M_199260_cds_1_0_chr21_10908825_r|NM_199261_cds_1_0_chr21_10 908825_r chr21 10910306 10910399 NM_199259_cds_2_0_chr21_10910307_r|N M_199260_cds_2_0_chr21_10910307_r|NM_199261_cds_2_0_chr21_10 910307_r chr21 10914362 10914442 NM_199259_cds_3_0_chr21_10914363_r|N M_199260_cds_3_0_chr21_10914363_r|NM_199261_cds_3_0_chr21_10 914363_r chr21 10916369 10916475 NM_199259_cds_4_0_chr21_10916370_r|N M_199260_cds_4_0_chr21_10916370_r|NM_199261_cds_4_0_chr21_10 916370_r # extract fasta for each line infasta=ref/HiSeq_UCSC_hg19.fa outfasta=ref/merged_${infile%%.bed}.fasta bedtools getfasta -fi ${infasta} -bed ${infolder}/merged_${infile} -fo ${outfasta} # inspect the first three sequences # !GOODY: the last part of the pipe breaks the sequence at 60 characters to display it properly head -6 ${outfasta} | sed -e "s/.\{60\}/&\n/g" >chr21:10906904-10907040 TTAATCGGATCCAGCTACAACATCACTGGAAGTCATTTTCTCGCCAAAAAGTATCTCCAC GGCAAAATCTGATGGATAAATTCTCCGTGCTTTTTGTTTATGTAGATTATCCAATTCATT TTTTGGTAGATAAAGC >chr21:10908824-10908895 CTGTTATTTTCAATAAAAGATGTGTGCAACCAGAAGTAAAATGAGCAATTGTCATAGTAT GTAGGAAGATT >chr21:10910306-10910399 CGAATAGAAAAACTGCACTTTCACATCATCATACAGAGGTAGACCGTCGAATACATCAAT TAATATTTTGTCTGTTGTAATGTTATCAAGTAC # OR extract fasta for each line and use name field for the fasta header (-name) infasta=ref/HiSeq_UCSC_hg19.fa outfasta=ref/merged_name_${infile%%.bed}.fasta bedtools getfasta -name -fi ${infasta} -bed ${infolder}/merged_${infile} -fo ${outfasta} # inspect the first three sequences # !GOODY: the last part of the pipe breaks the sequence at 60 characters to display it properly head -6 ${outfasta} | sed -e "s/.\{60\}/&\n/g" >NM_199259_cds_0_0_chr21_10906905_r|NM_199260_cds_0_0_chr21_ 10906905_r|NM_199261_cds_0_0_chr21_10906905_r TTAATCGGATCCAGCTACAACATCACTGGAAGTCATTTTCTCGCCAAAAAGTATCTCCAC GGCAAAATCTGATGGATAAATTCTCCGTGCTTTTTGTTTATGTAGATTATCCAATTCATT TTTTGGTAGATAAAGC >NM_199259_cds_1_0_chr21_10908825_r|NM_199260_cds_1_0_chr21_ 10908825_r|NM_199261_cds_1_0_chr21_10908825_r CTGTTATTTTCAATAAAAGATGTGTGCAACCAGAAGTAAAATGAGCAATTGTCATAGTAT GTAGGAAGATT >NM_199259_cds_2_0_chr21_10910307_r|NM_199260_cds_2_0_chr21_ 10910307_r|NM_199261_cds_2_0_chr21_10910307_r CGAATAGAAAAACTGCACTTTCACATCATCATACAGAGGTAGACCGTCGAATACATCAAT TAATATTTTGTCTGTTGTAATGTTATCAAGTAC
bedtools igv to collect screenshots for a list of genomic loci
bedtools igv will create screenshots automatically from a list of loci from a BED file. This can be very useful if you want to document all variants from a list or all genes from a pathway using IGV views.
bedtools igv command details
Tool: bedtools igv (aka bedToIgv) Version: v2.25.0 Summary: Creates a batch script to create IGV images at each interval defined in a BED/GFF/VCF file. Usage: bedtools igv [OPTIONS] -i <bed/gff/vcf> Options: -path The full path to which the IGV snapshots should be written. (STRING) Default: ./ -sess The full path to an existing IGV session file to be loaded prior to taking snapshots. (STRING) Default is for no session to be loaded. -sort The type of BAM sorting you would like to apply to each image. Options: base, position, strand, quality, sample, and readGroup Default is to apply no sorting at all. -clps Collapse the aligned reads prior to taking a snapshot. Default is to no collapse. -name Use the "name" field (column 4) for each image's filename. Default is to use the "chr:start-pos.ext". -slop Number of flanking base pairs on the left & right of the image. - (INT) Default = 0. -img The type of image to be created. Options: png, eps, svg Default is png. Notes: (1) The resulting script is meant to be run from within IGV. (2) Unless you use the -sess option, it is assumed that prior to running the script, you've loaded the proper genome and tracks.
# instead of reading from a bed file we fake a BED line with a redirect '<()' and the echo function # (note the -e to translate the TABs required for bedtools) outfolder=$BASE/IGV_visualisation bedtools igv -i <(echo -e "chr21\t42218872\t42218983") -img png -path ${outfolder} > ${outfolder}/igv_cmd.sh # the file contains the following text snapshotDirectory $BASE/IGV_visualisation goto chr21_42218872_42218983 snapshot chr21_42218872_42218983.png # then in IGV, run the bash script from the tools menu
the chr21_42218872_42218983.png snapshot
download exercise files
Download exercise files here
References:
- ↑
Liran Juan, Mingxiang Teng, Tianyi Zang, Yafeng Hao, Zhenxing Wang, Chengwu Yan, Yongzhuang Liu, Jie Li, Tianjiao Zhang, Yadong Wang
The personal genome browser: visualizing functions of genetic variants.
Nucleic Acids Res: 2014, 42(Web Server issue);W192-7
[PubMed:24799434] ##WORLDCAT## [DOI] (I p) - ↑
Stephan Pabinger, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler, Michael R Speicher, Johannes Zschocke, Zlatko Trajanoski
A survey of tools for variant analysis of next-generation genome sequencing data.
Brief Bioinform: 2014, 15(2);256-78
[PubMed:23341494] ##WORLDCAT## [DOI] (I p) - ↑ http://bedtools.readthedocs.org/en/latest/content/bedtools-suite.html
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.6 ]