NGS Exercise.8

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.7 | NGS Exercise.9 ]
# updated 2014 version


Visualize all results in the Interactive Genome Viewer (IGV)


HUNK_no-cvg.png

Variant visualization tools

Handicon.png Due to the intrinsic noise in NGS data, Visualisation is vital in order to evaluate results and generate hypothesis prior to final validation. Visualisation also allows presenting the obtained data side by side with pieces of evidence originating from the zillion resources out-there.

What tool should you use?

Many popular visualization tools are available that present data in ways it can be better evaluated. We list below a coupe of open-source free products, many other free or commercial tools exist but listing them all is not the focus of this document.

  • Circos is a very popular graphic package to display circular plots with annotations . Please see one simple Circos example NGS_Exercise.4
  • A recent paper describes the Personal Genome Browser (PGB) designed to visualize human variants (http://www.pgbrowser.org/). As NA18507 is part of the available genomes, you can use this resource to confirm variants found during this session and add our VCF data to the browser. You may also want to read the accompanying paper [1] that lists many more tools that can be used in the same context.
  • Other related tools are cited in the PGB publication cited above or in A survey of tools for variant analysis of next-generation genome sequencing data [2]

IGV: The Interactive Genome Viewer

Some users may prefer other gene browsers and are welcome to keep using them but we chose IGV as favorite genome viewer because it is well documented, already contains links to precious public datasets including Encode and UCSC tacks and because it is easy to deploy, fast and robust.

  • IGV supports a large number of standard formats that can be found on their information pages (<http://www.broadinstitute.org/igv/FileFormats>).
  • Loading data into IGV is simple and often makes use of the IGVTools; also available from the Broad institute. IGV tools are efficient when tiling or indexing is required to speed-up data parsing (large bed files). Other lighter formats can be opened as such without performance drop.
  • IGV can also be used in combination with bedtools to export screen-shots in batch for loci listed in a bed file. To discover this command, please type *bedtools igv -h* (previously *bedToIgv*) in your terminal window for more information about this feature.

IGV will be used during the session and is not further documented here, please refer to the very extensive online help content on the 'omnipresent' Broad Institute site for more information. (<http://www.broadinstitute.org/igv>).

example: samtools view to fetch region-specific reads from the Internet

Handicon.png samtools view can read **directly from the internet** when the repository also contains the index file

This can be really interesting when you need reads from a small region of the genome and the BAM file on the internet is the full genome (many GBs).

The following code example extracts reads mapping to the DSCAM (your favorite gene?) region on chr21 directly from the WEB for the NA18507 genome sequenced by Complete Genomics. Only evidence reads are reported (those participating to Complete Genomics local realignment to find variants) which explains why reads are grouped in a region and not present all over the chromosome length. Regions without reads on the left and right of the screen did not contain putative calls and were not stored as evidenceSupport data (but the corresponding raw reads are present elsewhere in the CG data for computing coverage).

# add 1kb at either ends of the DSCAM locus: chr21:42,218,289-42,219,325
# result are stored to a small local file and indexed.
outfolder="public_info/CG_variants/"
 
baseurl="ftp://ftp.1000genomes.ebi.ac.uk"
path="/vol1/ftp/data/NA18507/cg_data/"
datafile="NA18507_buffy_SRR822930.mapped.COMPLETE_GENOMICS.CGworkflow2_2_evidenceSupport.YRI.high_coverage.20130401.bam"
 
filter="chr21:42,218,289-42,219,325"
 
samtools view -b -h ${baseurl}/${path}/${datafile} ${filter} > \
	${outfolder}/DSCAM_region.bam && samtools index \
	${outfolder}/DSCAM_region.bam

The BAM file can be visualised in **IGV** to show the sequencing read pileup and supporting evidence for variant calls (here zoomed on the exon1 of human DSCAM with T->C polymorphism at position 42'218'928 in blue: chr21:42,218,289-42,219,325).


IGV view of the first DSCAM exon (reverse orientation)

CG_DSCAM_ex1.png

Handicon.png A region was created to point to DSCAM (> Regions/Region Navigator). You can also type DSCAM in the search box to find the locus

REM: Additional information about colours found in alignments can be obtained from the online IGV documentation.

IGV color codes for reads

readpairorientations.jpg

Handicon.png feature jumping and exon jumping

  • To feature-jump, you select a feature track and press Ctrl-F for forward, Ctrl-B for back.
  • To exon-jump, you select a feature track and press Shift- Ctrl-F to center the next exon in your view, Shift- Ctrl-B to move back one exon.

example: samtools faidx to fetch a region of the reference genome

Handicon.png samtools faidx should have been applied to the reference genome fasta file in order to use this functionality

This operation is often required to get the context of a variant and possibly align it to other sequences for control. The samtools faidx indexing of the reference genome allows very rapid retrieval of sequence using the same command with start and stop coordinates. An example below extracts 50bps either sides of the variant cited above.

IGV view of a hom variant in the first DSCAM exon

In addition to the reference genome and refseq transcripts, this view shows:
  • a track for bcftools calls
  • the track showing all reads mapped in the BWA mem experiment
  • the read coverage depth derived by IGV from the BAM track and showing no coverage around the Start codon (while Complete Genomics reads did show significant coverage in the same window - see above)
DSCAM_UTR-variant.png
# requires indexing the reference file (done)
# => samtools faidx ref/HiSeq_UCSC_hg19.fa
 
# the following command adds 50bps at either ends of the DSCAM variant position: chr21: 42218928
reference=ref/HiSeq_UCSC_hg19.fa
target="chr21:42218878-42218978"
samtools faidx ${reference} ${target}
 
>chr21:42218878-42218978
TCGGCAGGTGGAGAGAGCCGCAGACGCGGGCGTGAGCTCATCCCGGGCACTCGGCGCCCA
GAATGCGCAGCCAGCGGCCTCCCGTGCGCCAGCCCTCACCC

IGV view around a variant

In addition to the sequence view from the IGV RefSeq track, users can see here:
  • translation of the currently viewed 5'-UTR (not relevnant! => alway inspect the type of exon displayed and its orientation)
  • the read coverage depth derived by IGV from the BAM track
  • the variant metrics extracted from the BWA-mem VCF track (mouse over the red histogram bar in the vcf-track to get this popup window)
DSCAM_variant_100bps.png

example: bedtools getfasta to fetch multiple regions of the reference genome

What if you need many regions, obtained from bedtools manipulations or other sources? bedtools getfasta will store the hg19 DNA sequence of each region defined above into a multifasta file

getfasta-glyph.png

getfasta command details

Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.17.0-111-gab7bbbe
Summary: Extract DNA sequences into a fasta file based on feature coordinates.
 
Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> -fo <fasta> 
 
Options: 
	-fi	Input FASTA file
	-bed	BED/GFF/VCF file of ranges to extract from -fi
	-fo	Output file (can be FASTA or TAB-delimited)
	-name	Use the name field for the FASTA header
	-split	given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
	-tab	Write output in TAB delimited format.
		- Default is FASTA format.
 
	-s	Force strandedness. If the feature occupies the antisense,
		strand, the sequence will be reverse complemented.
		- By default, strand information is ignored.
# download only coding exons for hg19 chr21 using the UCSC table browser in BED format
infolder=ref
infile=hg19_chr21-refGene-CDS.bed
head -5 ${infolder}/${infile}
chr21	10906904	10907040	NM_199259_cds_0_0_chr21_10906905_r	0	-
chr21	10908824	10908895	NM_199259_cds_1_0_chr21_10908825_r	0	-
chr21	10910306	10910399	NM_199259_cds_2_0_chr21_10910307_r	0	-
chr21	10914362	10914442	NM_199259_cds_3_0_chr21_10914363_r	0	-
chr21	10916369	10916475	NM_199259_cds_4_0_chr21_10916370_r	0	-
# the result is a redundant bed where alternative gene models share common exons
# sort the lines by chr+coord
sort -k 1V,1 -k 2n,2 -k 3n,3 ${infolder}/${infile} > ${infolder}/srt_${infile}
head -5 ${infolder}/srt_${infile}
chr21	10906904	10907040	NM_199259_cds_0_0_chr21_10906905_r	0	-
chr21	10906904	10907040	NM_199260_cds_0_0_chr21_10906905_r	0	-
chr21	10906904	10907040	NM_199261_cds_0_0_chr21_10906905_r	0	-
chr21	10908824	10908895	NM_199259_cds_1_0_chr21_10908825_r	0	-
chr21	10908824	10908895	NM_199260_cds_1_0_chr21_10908825_r	0	-
# merge overlapping exons and keep alternative names separated by '|'
bedtools merge -nms -delim "|" -i  ${infolder}/srt_${infile} > ${infolder}/merged_${infile}
head -5 ${infolder}/merged_${infile} | sed -e "s/.\{60\}/&\n/g"
chr21  10906904  10907040  NM_199259_cds_0_0_chr21_10906905_r|NM_199260_cds_0_0_chr21_10906905_r|NM_199261_cds_0_0_chr21_10906905_r
chr21  10908824  10908895  NM_199259_cds_1_0_chr21_10908825_r|NM_199260_cds_1_0_chr21_10908825_r|NM_199261_cds_1_0_chr21_10908825_r
chr21  10910306  10910399  NM_199259_cds_2_0_chr21_10910307_r|NM_199260_cds_2_0_chr21_10910307_r|NM_199261_cds_2_0_chr21_10910307_r
chr21  10914362  10914442  NM_199259_cds_3_0_chr21_10914363_r|NM_199260_cds_3_0_chr21_10914363_r|NM_199261_cds_3_0_chr21_10914363_r
chr21  10916369  10916475  NM_199259_cds_4_0_chr21_10916370_r|NM_199260_cds_4_0_chr21_10916370_r|NM_199261_cds_4_0_chr21_10916370_r
# extract fasta for each line
infasta=ref/HiSeq_UCSC_hg19.fa
outfasta=ref/merged_${infile%%.bed}.fasta
bedtools getfasta -fi ${infasta} -bed ${infolder}/merged_${infile} -fo ${outfasta}  
 
# inspect the first three sequences
# !GOODY: the last part of the pipe breaks the sequence at 60 characters to display it properly
head -6 ${outfasta} | sed -e "s/.\{60\}/&\n/g"
>chr21:10906904-10907040
TTAATCGGATCCAGCTACAACATCACTGGAAGTCATTTTCTCGCCAAAAAGTATCTCCAC
GGCAAAATCTGATGGATAAATTCTCCGTGCTTTTTGTTTATGTAGATTATCCAATTCATT
TTTTGGTAGATAAAGC
>chr21:10908824-10908895
CTGTTATTTTCAATAAAAGATGTGTGCAACCAGAAGTAAAATGAGCAATTGTCATAGTAT
GTAGGAAGATT
>chr21:10910306-10910399
CGAATAGAAAAACTGCACTTTCACATCATCATACAGAGGTAGACCGTCGAATACATCAAT
TAATATTTTGTCTGTTGTAATGTTATCAAGTAC
# OR extract fasta for each line and use name field for the fasta header (-name)
infasta=ref/HiSeq_UCSC_hg19.fa
outfasta=ref/merged_name_${infile%%.bed}.fasta
bedtools getfasta -name -fi ${infasta} -bed ${infolder}/merged_${infile} -fo ${outfasta}  
 
# inspect the first three sequences
# !GOODY: the last part of the pipe breaks the sequence at 60 characters to display it properly
head -6 ${outfasta} | sed -e "s/.\{60\}/&\n/g"
>NM_199259_cds_0_0_chr21_10906905_r|NM_199260_cds_0_0_chr21
_10906905_r|NM_199261_cds_0_0_chr21_10906905_r
TTAATCGGATCCAGCTACAACATCACTGGAAGTCATTTTCTCGCCAAAAAGTATCTCCA
CGGCAAAATCTGATGGATAAATTCTCCGTGCTTTTTGTTTATGTAGATTATCCAATTCA
TTTTTTGGTAGATAAAGC
>NM_199259_cds_1_0_chr21_10908825_r|NM_199260_cds_1_0_chr21
_10908825_r|NM_199261_cds_1_0_chr21_10908825_r
CTGTTATTTTCAATAAAAGATGTGTGCAACCAGAAGTAAAATGAGCAATTGTCATAGTA
TGTAGGAAGATT
>NM_199259_cds_2_0_chr21_10910307_r|NM_199260_cds_2_0_chr21
_10910307_r|NM_199261_cds_2_0_chr21_10910307_r
CGAATAGAAAAACTGCACTTTCACATCATCATACAGAGGTAGACCGTCGAATACATCAA
TTAATATTTTGTCTGTTGTAATGTTATCAAGTAC

example: add public data to explain and support private results

The example above with BWA mem shows a region where no reads mapped to the first exon of DSCAM and should be explainable using genome features in that region. Two common reasons for the absence of maping are the presence of repeated elements and a very high level of GC in a locus. To support these, two public tracks can be added from the built-in resources accessible through >File >Load from Server

adding public annotation stracks

many tracks are available in the list, we select here only those related to repeats, structural variants, and GC content
annotation_tracks.png
It is clear that both GC% and simple repeats of the (CCG)n and GC_rich are present in the gap and are probably responsible for its existence.

result of adding public tracks

DSCAM_UTR-annotated.png

download exercise files

Download exercise files here

Use the right application to open the files present in ex8-files

References:
  1. Liran Juan, Mingxiang Teng, Tianyi Zang, Yafeng Hao, Zhenxing Wang, Chengwu Yan, Yongzhuang Liu, Jie Li, Tianjiao Zhang, Yadong Wang
    The personal genome browser: visualizing functions of genetic variants.
    Nucleic Acids Res: 2014, 42(Web Server issue);W192-7
    [PubMed:24799434] ##WORLDCAT## [DOI] (I p)

  2. Stephan Pabinger, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler, Michael R Speicher, Johannes Zschocke, Zlatko Trajanoski
    A survey of tools for variant analysis of next-generation genome sequencing data.
    Brief Bioinform: 2014, 15(2);256-78
    [PubMed:23341494] ##WORLDCAT## [DOI] (I p)

  3. http://www.broadinstitute.org/igv/AlignmentData
  4. http://www.broadinstitute.org/software/igv/interpreting_pair_orientations

[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.7 | NGS Exercise.9 ]