NGS-Var Exercise.7

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.6 ]


Review variants and annotations in IGV


HUNK_no-cvg.png

Genome visualization tools

Error creating thumbnail: Unable to save thumbnail to destination
Due to the intrinsic noise in NGS data, Visualisation is vital in order to evaluate results and generate hypothesis prior to final validation. Visualisation also allows presenting the obtained data side by side with pieces of evidence originating from the zillion resources out-there.

What tool should you use?

Many popular visualization tools are available that present data in ways it can be better evaluated. We list below a coupe of open-source free products, many other free or commercial tools exist but listing them all is not the focus of this document.

  • Circos is a very popular graphic package to display circular plots with annotations . Please see one simple Circos example NGS_Exercise.4
  • A recent paper describes the Personal Genome Browser (PGB) designed to visualize human variants (http://www.pgbrowser.org/). As NA18507 is part of the available genomes, you can use this resource to confirm variants found during this session and add our VCF data to the browser. You may also want to read the accompanying paper [1] that lists many more tools that can be used in the same context.
  • Other related tools are cited in the PGB publication cited above or in A survey of tools for variant analysis of next-generation genome sequencing data [2]

IGV: The Interactive Genome Viewer

Some users may prefer other gene browsers and are welcome to keep using them but we chose IGV as favorite genome viewer because it is well documented, already contains links to precious public datasets including Encode and UCSC tacks and because it is easy to deploy, fast and robust.

  • IGV supports a large number of standard formats that can be found on their information pages (<http://www.broadinstitute.org/igv/FileFormats>).
  • Loading data into IGV is simple and often makes use of the IGVTools; also available from the Broad institute. IGV tools are efficient when tiling or indexing is required to speed-up data parsing (large bed files). Other lighter formats can be opened as such without performance drop.
  • IGV can also be used in combination with bedtools to export screen-shots in batch for loci listed in a bed file. To discover this command, please type *bedtools igv -h* (previously *bedToIgv*) in your terminal window for more information about this feature.

IGV will be used during the session and is not further documented here, please refer to the very extensive online help content on the 'omnipresent' Broad Institute site for more information. (<http://www.broadinstitute.org/igv>).

igvtools from the command line

Besides the visualization Java application, you will also need to prepare data obtained from third-party. A Java GUI IGVTools application is provided from within IGV but is not always adapted to your needs. ‘’’igvtools’’’ can also be used at command line to compress, index or count from data.

igvtools command details

Program: igvtools. IGV Version 2.3.68 (81)01/13/2016 05:05 PM
 
Usage: igvtools [command] [options] [input file/dir] [other arguments]
 
Command: version print the version number
	 sort    sort an alignment file by start position. 
	 index   index an alignment file
	 toTDF    convert an input file (cn, gct, wig) to tiled data format (tdf)
	 count   compute coverage density for an alignment file
	 formatexp  center, scale, and log2 normalize an expression file
	 gui      Start the gui
	 help <command>     display this help message, or help on a specific command
	 See http://www.broadinstitute.org/software/igv/igvtools_commandline for more detailed help

example: from our mapping data zoom into the first exon of DSCAM

We arbitrarily choose a position in chr21 (copy paste the following address in IGV chr21:42,218,289-42,219,325) and load the mapping data (shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam) from file to view reads mapped to this region.

basic read mapping view in DSCAM1

DSCAM_UTR.png

  Now add the variant calls obtained with bcftools from the file (chr21_NA18507_var_bcftools.flt-D1000.vcf.gz)

read mapping view with VCF data

DSCAM_UTR-variant.png

example: add public data to explain and support private results

The example above with BWA mem shows a region where no reads mapped to the first exon of DSCAM and should be explainable using genome features in that region. Two common reasons for the absence of mapping are the presence of repeated elements and a very high level of GC in a locus. To support these, public tracks can be added from the built-in resources accessible through >File >Load from Server

adding public annotation stacks

many tracks are available in the list, we select here only those related to repeats, structural variants, GC content
annotation_tracks.png

  Rearrange the mapping track as collapsed to save some space and move them down in the IGV window. It is clear that both GC% and simple repeats of the (CCG)n and GC_rich are present in the gap and are probably responsible for its existence.

result of adding public tracks

DSCAM_UTR-annotated.png

  Now remove repeat tracks and add instead variation tracks from the server, you will find that our SNP call is known in dbSNP as rs9975082.

result of adding public variants

DSCAM_UTR-public-variants.png

 

Extract parts of data files using samtools and bedtools

In this exercise, we will prepare samples from whole genome data files that can be viewed or used in many situations. Many more tools are available and we strongly encourage you to discover them. Have a look to the great online Bedtools2 documentation here[3]

example: samtools faidx to fetch a region of the reference genome

Error creating thumbnail: Unable to save thumbnail to destination
samtools faidx should have been applied to the reference genome fasta file in order to use this functionality

This operation is often required to get the context of a variant and possibly align it to other sequences for control. The samtools faidx indexing of the reference genome allows very rapid retrieval of sequence using the same command with start and stop coordinates. An example below extracts 50bps of reference sequence either sides of the variant cited above.

IGV view of a hom variant in the first DSCAM exon

In addition to the reference genome and refseq transcripts, this view shows:
  • a track for bcftools calls
  • the track showing all reads mapped in the BWA mem experiment
  • the read coverage depth derived by IGV from the BAM track and showing no coverage around the Start codon (while Complete Genomics reads did show significant coverage in the same window - see above)
DSCAM_UTR-variant.png
# requires indexing the reference file (done)
# => samtools faidx ref/HiSeq_UCSC_hg19.fa
 
# the following command adds 50bps at either ends of the DSCAM variant position: chr21: 42218928
reference=ref/HiSeq_UCSC_hg19.fa
target="chr21:42218878-42218978"
samtools faidx ${reference} ${target}
 
>chr21:42218878-42218978
TCGGCAGGTGGAGAGAGCCGCAGACGCGGGCGTGAGCTCATCCCGGGCACTCGGCGCCCA
GAATGCGCAGCCAGCGGCCTCCCGTGCGCCAGCCCTCACCC

IGV view around a variant

In addition to the sequence view from the IGV RefSeq track, users can see here:
  • translation of the currently viewed 5'-UTR (not relevnant! => alway inspect the type of exon displayed and its orientation)
  • the read coverage depth derived by IGV from the BAM track
  • the variant metrics extracted from the BWA-mem VCF track (mouse over the red histogram bar in the vcf-track to get this popup window)
DSCAM_variant_100bps.png

example: bedtools getfasta to fetch multiple regions of the reference genome

What if you need many regions, obtained from bedtools manipulations or other sources? bedtools getfasta will store the hg19 DNA sequence of each region defined above into a multifasta file

getfasta-glyph.png

getfasta command details

Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.25.0
Summary: Extract DNA sequences into a fasta file based on feature coordinates.
 
Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> -fo <fasta> 
 
Options: 
        -fi     Input FASTA file
        -bed    BED/GFF/VCF file of ranges to extract from -fi
        -fo     Output file (can be FASTA or TAB-delimited)
        -name   Use the name field for the FASTA header
        -split  given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
        -tab    Write output in TAB delimited format.
                - Default is FASTA format.
 
        -s      Force strandedness. If the feature occupies the antisense,
                strand, the sequence will be reverse complemented.
                - By default, strand information is ignored.
 
        -fullHeader     Use full fasta header.
                - By default, only the word before the first space or tab is used.
# download only coding exons for hg19 chr21 using the UCSC table browser in BED format
infolder=ref
infile=hg19_chr21-refGene-CDS.bed
head -5 ${infolder}/${infile}
 
chr21	10906904	10907040	NM_199259_cds_0_0_chr21_10906905_r	0	-
chr21	10908824	10908895	NM_199259_cds_1_0_chr21_10908825_r	0	-
chr21	10910306	10910399	NM_199259_cds_2_0_chr21_10910307_r	0	-
chr21	10914362	10914442	NM_199259_cds_3_0_chr21_10914363_r	0	-
chr21	10916369	10916475	NM_199259_cds_4_0_chr21_10916370_r	0	-
 
# the result is a redundant bed where alternative gene models share common exons
# sort the lines by chr+coord
sort -k 1V,1 -k 2n,2 -k 3n,3 ${infolder}/${infile} > ${infolder}/srt_${infile}
head -5 ${infolder}/srt_${infile}
 
chr21	10906904	10907040	NM_199259_cds_0_0_chr21_10906905_r	0	-
chr21	10906904	10907040	NM_199260_cds_0_0_chr21_10906905_r	0	-
chr21	10906904	10907040	NM_199261_cds_0_0_chr21_10906905_r	0	-
chr21	10908824	10908895	NM_199259_cds_1_0_chr21_10908825_r	0	-
chr21	10908824	10908895	NM_199260_cds_1_0_chr21_10908825_r	0	-

example: bedtools merge to collapse redundant annotations

bedtools merge will collapse annotations based on a field

merge-glyph.png

bedtools merge command details

Tool:    bedtools merge (aka mergeBed)
Version: v2.25.0
Summary: Merges overlapping BED/GFF/VCF entries into a single interval.
 
Usage:   bedtools merge [OPTIONS] -i <bed/gff/vcf>
 
Options: 
        -s      Force strandedness.  That is, only merge features
                that are on the same strand.
                - By default, merging is done without respect to strand.
 
        -S      Force merge for one specific strand only.
                Follow with + or - to force merge from only
                the forward or reverse strand, respectively.
                - By default, merging is done without respect to strand.
 
        -d      Maximum distance between features allowed for features
                to be merged.
                - Def. 0. That is, overlapping & book-ended features are merged.
                - (INTEGER)
                - Note: negative values enforce the number of b.p. required for overlap.
 
        -c      Specify columns from the B file to map onto intervals in A.
                Default: 5.
                Multiple columns can be specified in a comma-delimited list.
 
        -o      Specify the operation that should be applied to -c.
                Valid operations:
                    sum, min, max, absmin, absmax,
                    mean, median,
                    collapse (i.e., print a delimited list (duplicates allowed)), 
                    distinct (i.e., print a delimited list (NO duplicates allowed)), 
                    distinct_sort_num (as distinct, sorted numerically, ascending),
                    distinct_sort_num_desc (as distinct, sorted numerically, desscending),
                    distinct_only (delimited list of only unique values),
                    count
                    count_distinct (i.e., a count of the unique values in the column), 
                    first (i.e., just the first value in the column), 
                    last (i.e., just the last value in the column), 
                Default: sum
                Multiple operations can be specified in a comma-delimited list.
 
                If there is only column, but multiple operations, all operations will be
                applied on that column. Likewise, if there is only one operation, but
                multiple columns, that operation will be applied to all columns.
                Otherwise, the number of columns must match the the number of operations,
                and will be applied in respective order.
                E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5,
                the mean of column 4, and the count of column 6.
                The order of output columns will match the ordering given in the command.
 
 
        -delim  Specify a custom delimiter for the collapse operations.
                - Example: -delim "|"
                - Default: ",".
 
        -prec   Sets the decimal precision for output (Default: 5)
 
        -bed    If using BAM input, write output as BED.
 
        -header Print the header from the A file prior to results.
 
        -nobuf  Disable buffered output. Using this option will cause each line
                of output to be printed as it is generated, rather than saved
                in a buffer. This will make printing large output files 
                noticeably slower, but can be useful in conjunction with
                other software tools and scripts that need to process one
                line of bedtools output at a time.
 
        -iobuf  Specify amount of memory to use for input buffer.
                Takes an integer argument. Optional suffixes K/M/G supported.
                Note: currently has no effect with compressed files.
 
Notes: 
        (1) The input file (-i) file must be sorted by chrom, then start.
# merge overlapping exons and keep alternative names separated by '|'
bedtools merge -c 4 -o distinct_only -delim "|" -i  ${infolder}/srt_${infile} > ${infolder}/merged_${infile}
head -5 ${infolder}/merged_${infile} | sed -e "s/.\{60\}/&\n/g"
 
chr21   10906904        10907040        NM_199259_cds_0_0_chr21_10906905_r|N
M_199260_cds_0_0_chr21_10906905_r|NM_199261_cds_0_0_chr21_10
906905_r
chr21   10908824        10908895        NM_199259_cds_1_0_chr21_10908825_r|N
M_199260_cds_1_0_chr21_10908825_r|NM_199261_cds_1_0_chr21_10
908825_r
chr21   10910306        10910399        NM_199259_cds_2_0_chr21_10910307_r|N
M_199260_cds_2_0_chr21_10910307_r|NM_199261_cds_2_0_chr21_10
910307_r
chr21   10914362        10914442        NM_199259_cds_3_0_chr21_10914363_r|N
M_199260_cds_3_0_chr21_10914363_r|NM_199261_cds_3_0_chr21_10
914363_r
chr21   10916369        10916475        NM_199259_cds_4_0_chr21_10916370_r|N
M_199260_cds_4_0_chr21_10916370_r|NM_199261_cds_4_0_chr21_10
916370_r
 
# extract fasta for each line
infasta=ref/HiSeq_UCSC_hg19.fa
outfasta=ref/merged_${infile%%.bed}.fasta
bedtools getfasta -fi ${infasta} -bed ${infolder}/merged_${infile} -fo ${outfasta}  
 
# inspect the first three sequences
# !GOODY: the last part of the pipe breaks the sequence at 60 characters to display it properly
head -6 ${outfasta} | sed -e "s/.\{60\}/&\n/g"
 
>chr21:10906904-10907040
TTAATCGGATCCAGCTACAACATCACTGGAAGTCATTTTCTCGCCAAAAAGTATCTCCAC
GGCAAAATCTGATGGATAAATTCTCCGTGCTTTTTGTTTATGTAGATTATCCAATTCATT
TTTTGGTAGATAAAGC
>chr21:10908824-10908895
CTGTTATTTTCAATAAAAGATGTGTGCAACCAGAAGTAAAATGAGCAATTGTCATAGTAT
GTAGGAAGATT
>chr21:10910306-10910399
CGAATAGAAAAACTGCACTTTCACATCATCATACAGAGGTAGACCGTCGAATACATCAAT
TAATATTTTGTCTGTTGTAATGTTATCAAGTAC
 
# OR extract fasta for each line and use name field for the fasta header (-name)
infasta=ref/HiSeq_UCSC_hg19.fa
outfasta=ref/merged_name_${infile%%.bed}.fasta
bedtools getfasta -name -fi ${infasta} -bed ${infolder}/merged_${infile} -fo ${outfasta}  
 
# inspect the first three sequences
# !GOODY: the last part of the pipe breaks the sequence at 60 characters to display it properly
head -6 ${outfasta} | sed -e "s/.\{60\}/&\n/g"
 
>NM_199259_cds_0_0_chr21_10906905_r|NM_199260_cds_0_0_chr21_
10906905_r|NM_199261_cds_0_0_chr21_10906905_r
TTAATCGGATCCAGCTACAACATCACTGGAAGTCATTTTCTCGCCAAAAAGTATCTCCAC
GGCAAAATCTGATGGATAAATTCTCCGTGCTTTTTGTTTATGTAGATTATCCAATTCATT
TTTTGGTAGATAAAGC
>NM_199259_cds_1_0_chr21_10908825_r|NM_199260_cds_1_0_chr21_
10908825_r|NM_199261_cds_1_0_chr21_10908825_r
CTGTTATTTTCAATAAAAGATGTGTGCAACCAGAAGTAAAATGAGCAATTGTCATAGTAT
GTAGGAAGATT
>NM_199259_cds_2_0_chr21_10910307_r|NM_199260_cds_2_0_chr21_
10910307_r|NM_199261_cds_2_0_chr21_10910307_r
CGAATAGAAAAACTGCACTTTCACATCATCATACAGAGGTAGACCGTCGAATACATCAAT
TAATATTTTGTCTGTTGTAATGTTATCAAGTAC

Technical.png Adding '

bedtools igv to collect screenshots for a list of genomic loci

bedtools igv will create screenshots automatically from a list of loci from a BED file. This can be very useful if you want to document all variants from a list or all genes from a pathway using IGV views.

bedtools igv command details

Tool:    bedtools igv (aka bedToIgv)
Version: v2.25.0
Summary: Creates a batch script to create IGV images 
         at each interval defined in a BED/GFF/VCF file.
 
Usage:   bedtools igv [OPTIONS] -i <bed/gff/vcf>
 
Options: 
        -path   The full path to which the IGV snapshots should be written.
                (STRING) Default: ./
 
        -sess   The full path to an existing IGV session file to be 
                loaded prior to taking snapshots.
 
                (STRING) Default is for no session to be loaded.
 
        -sort   The type of BAM sorting you would like to apply to each image. 
                Options: base, position, strand, quality, sample, and readGroup
                Default is to apply no sorting at all.
 
        -clps   Collapse the aligned reads prior to taking a snapshot. 
                Default is to no collapse.
 
        -name   Use the "name" field (column 4) for each image's filename. 
                Default is to use the "chr:start-pos.ext".
 
        -slop   Number of flanking base pairs on the left & right of the image.
                - (INT) Default = 0.
 
        -img    The type of image to be created. 
                Options: png, eps, svg
                Default is png.
 
Notes: 
        (1)  The resulting script is meant to be run from within IGV.
        (2)  Unless you use the -sess option, it is assumed that prior to 
                running the script, you've loaded the proper genome and tracks.
# instead of reading from a bed file we fake a BED line with a redirect '<()' and the echo function 
#    (note the -e to translate the TABs required for bedtools)
outfolder=$BASE/IGV_visualisation
bedtools igv -i <(echo -e "chr21\t42218872\t42218983") -img png -path ${outfolder} > ${outfolder}/igv_cmd.sh
 
# the file contains the following text
snapshotDirectory $BASE/IGV_visualisation
goto chr21_42218872_42218983
snapshot chr21_42218872_42218983.png
 
# then in IGV, run the bash script from the tools menu


tools_run-bash-script.png

the chr21_42218872_42218983.png snapshot

chr21_42218872_42218983.png

 

download exercise files

Download exercise files here

Use the right application to open the files present in ex7-files

References:
  1. Liran Juan, Mingxiang Teng, Tianyi Zang, Yafeng Hao, Zhenxing Wang, Chengwu Yan, Yongzhuang Liu, Jie Li, Tianjiao Zhang, Yadong Wang
    The personal genome browser: visualizing functions of genetic variants.
    Nucleic Acids Res: 2014, 42(Web Server issue);W192-7
    [PubMed:24799434] ##WORLDCAT## [DOI] (I p)

  2. Stephan Pabinger, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler, Michael R Speicher, Johannes Zschocke, Zlatko Trajanoski
    A survey of tools for variant analysis of next-generation genome sequencing data.
    Brief Bioinform: 2014, 15(2);256-78
    [PubMed:23341494] ##WORLDCAT## [DOI] (I p)

  3. http://bedtools.readthedocs.org/en/latest/content/bedtools-suite.html

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.6 ]