Next-generation sequencing
High-throughput sequencers, also called 'next-generation' ('next-gen' or 'ngs'), or sometimes 'second-generation' (as opposed to third generation) sequencers are technologies that deliver 10⁵ to several 10⁶ of DNA reads, covering millions of bases. It is being used to (re)sequence genomes, determine the DNA-binding sites of proteins (ChIP-seq), sequence transcriptomes (RNA-seq).
These technologies bring analysis of sequence information to another level. Rethinking experiments is crucial.
Contents
Manufacturers and technologies
- Solexa/Illumina - 1-3 Gigabase (Gb) reads of 36 or 150 bp
- video - click on Technology
- Roche/454 - 0.1 Gb reads of 400-700 bp
- ABI/SOLiD - 2-3 Gb reads of 35-75 bp
- Helicos - 8 Gb read of 25-45 bp
- Complete Genomics - Genome sequencing as a service, a partner of VIB to sequence complete genomes of humans (check TechWatch) (see Complete Genomics BITS wiki page)
- Ion Torrent
Databases of reads
- NCBI Short Read Archive, (See Short Read Archive Overview for more information)
- European Nucleotide Archive - Reads
- DDBJ Trace/Short Read Archive, Submissions, Data Release
Data formats
When you have millions of reads, you want to get rid of the reads as soon as possible, since a read on its own does not contain relevant information. Merging overlapping reads (assembling) can lead already to a large reduction of data size. If you have reference genome available, you can align the reads to the reference genome (mapping) and store the positions, the counts and the sequence deviations to that reference genome.
- MAQ .map format (a compressed binary file specifically designed for short read alignment)
- AMOS A Modular Open-Source Assembler, assembly format used by velvet
- SRF Sequence Read Format (also called Short Read Format), solid2srf, illumina2srf
- MINSEQE Minimum Information about a high-throughput SeQuencing Experiment
- FASTQ format is a common format for short reads with quality scores. It is supported in EMBOSS 6.1.0 as a sequence format. Quality scores are also used if the format is more explicitly named in EMBOSS: fastqsanger or fastqillumina
Standard Analysis Workflow of HTS DNA reads
Assembly
Depending on the sample, reads may be assembled before being mapped to a reference genome (if there is any). Assembly will merge overlapping sequence into one sequence. Assembly is a very computationally demanding task.
- MIRA - also part of the EMBASSY package.
Mapping
See Mappers.
Quality Assessment
High-throughput sequencing data contains a fair amount of errors. To discern the sequencing errors from genuine sequencing different a sufficient sequencing depth is needed together with a good quality assessment of the reads
Tools
See also this NBIC wiki page
HTS data analysis packages
- SAMtools, the program package distributed with the SAM format (Win,Linux,MacOS)
- BioConductor (R) packages for HTS data analysis
Using R packages
Manuals for HT Sequence Analysis with R and Bioconductor
- ShortRead - Quality assesment of the reads, finding duplicates, trimming, string pattern searches [8]
- Biostrings - Reading sequences in R [9]
- BSGenomes - Reading in complete genomes and BioC annotation data [10]
- DEGSeq - Identify differentially expressed genes from RNA-Seq data. [11]
- IRanges - infrastructure for positional data.
- biomaRt - interface to BioMart annotations.
- rtracklayer - interface to online and other genome browsers.
- chipseq & ChIPpeakAnno - Chip-Seq analysis.
Visualisation
Stand-alone viewing tools for high-throughput sequencing data
- Tablet, SCRI
- Integrative Genome Viewer, very powerful JAVA based viewer (Win, Mac, Linux)
- EagleView, the Marth Lab
- AnnoJ genome browser Visualising deep sequencing data and other genome annotation data.
- MagicViewer, large-scale short reads can be displayed in a zoomable interface under user-defined color scheme through an operating system-independent manner
- MapView - Visualising short reads alignments on a desktop computer
- JGI Genome Browser - Visual tool for viewing assembled genomes.
- Lightweight Genome Viewer - A small genome viewer
Purposes of HT sequencing
Transcriptome sequencing
Also called RNA-seq
SNP discovery
Sometimes referred to as "SNP-seq"
- GigaBayes
- SNPExpress, a database enabling researchers to input a SNP, gene, or a genomic region to investigate regions of interest for localized effects of SNPs on exon and gene level expression changes.
- PolyBayes - SNP discovery from MarthLab
Structural variation
Copy Number Variation
Promotor analysis
- ChIP-seq - DNA-protein interaction
Methylome sequencing
- BS-seq - bisulfite treatment and sequencing
Small RNA profiling
mRNA expression profiling
Digital gene expression (DGE)
Other useful information sources
- SeqAnswers - find all your answers on this popular next-generation sequencing wiki
- de novo transcriptome analysis using 454 - Experimental protocols, articles, but also bioinformatic scripts (Perl) - MatzLab
- [12] - a lot of software on Sequence Analysis