[ Overviews | Main_Page ]

High-throughput sequencers, also called 'next-generation' ('next-gen' or 'ngs'), or sometimes 'second-generation' (as opposed to third generation) sequencers are technologies that deliver 10⁵ to several 10⁶ of DNA reads, covering millions of bases. It is being used to (re)sequence genomes, determine the DNA-binding sites of proteins (ChIP-seq), sequence transcriptomes (RNA-seq).

These technologies bring analysis of sequence information to another level. Rethinking experiments is crucial.

Manufacturers and technologies

Solexa/Illumina - 1-3 Gigabase (Gb) reads of 36 or 150 bp
- video - click on Technology
Roche/454 - 0.1 Gb reads of 400-700 bp
- video
ABI/SOLiD - 2-3 Gb reads of 35-75 bp
- video
Helicos - 8 Gb read of 25-45 bp
- video
Complete Genomics - Genome sequencing as a service, a partner of VIB to sequence complete genomes of humans (check TechWatch) (see Complete Genomics BITS wiki page)
Ion Torrent
- video

Databases of reads

NCBI Short Read Archive, (See Short Read Archive Overview for more information)
European Nucleotide Archive - Reads
DDBJ Trace/Short Read Archive, Submissions, Data Release

Data formats

When you have millions of reads, you want to get rid of the reads as soon as possible, since a read on its own does not contain relevant information. Merging overlapping reads (assembling) can lead already to a large reduction of data size. If you have reference genome available, you can align the reads to the reference genome (mapping) and store the positions, the counts and the sequence deviations to that reference genome.

MAQ .map format (a compressed binary file specifically designed for short read alignment)
AMOS A Modular Open-Source Assembler, assembly format used by velvet
SRF Sequence Read Format (also called Short Read Format), solid2srf, illumina2srf
MINSEQE Minimum Information about a high-throughput SeQuencing Experiment
FASTQ format is a common format for short reads with quality scores. It is supported in EMBOSS 6.1.0 as a sequence format. Quality scores are also used if the format is more explicitly named in EMBOSS: fastqsanger or fastqillumina

Standard Analysis Workflow of HTS DNA reads

Assembly

Depending on the sample, reads may be assembled before being mapped to a reference genome (if there is any). Assembly will merge overlapping sequence into one sequence. Assembly is a very computationally demanding task.

MIRA - also part of the EMBASSY package.

Mapping

See Mappers.

Quality Assessment

High-throughput sequencing data contains a fair amount of errors. To discern the sequencing errors from genuine sequencing different a sufficient sequencing depth is needed together with a good quality assessment of the reads

Tools

Tool	NGS pipeline part	RPM?	Link to source
Cgatools	Format conversion - Analysis:SNP detection	No	http://cgatools.sourceforge.net/
FastQC	Quality control	No	http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
Fastxtoolkit	Quality control	Yes [1] and dep. [2]	http://hannonlab.cshl.edu/fastx_toolkit/
Filo	Format conversion	U.D. [3]	https://github.com/arq5x/filo/
HiTec	Quality control	No	http://www.csd.uwo.ca/~ilie/HiTEC/
Picard-tools	Format conversion	U.D.	http://picard.sourceforge.net/command-line-overview.shtml
Pindel	Analysis: SV	No	https://trac.nbic.nl/pindel/
Samstat	Quality control	No	http://samstat.sourceforge.net/
Samtools	Format conversion	YES – rpmsearch	http://samtools.sourceforge.net/
Bamtools	Format Conversion	No	https://github.com/pezmaster31/bamtools
Bamview	Visualisation	No	http://bamview.sourceforge.net/
Tabix	Format conversion	U.D. [4]	http://sourceforge.net/projects/samtools/files/
Vcftools	Format conversion - Data mining	U.D. [5]	http://vcftools.sourceforge.net/
Genomeanalysis TK	Quality control - Analysis	No	http://www.broadinstitute.org/gsa/wiki/index.php/Main_Page#The_Genome_Analysis_Toolkit_.28GATK.29
pe-asm	Assembling	No	http://code.google.com/p/pe-asm/
perl-Bio-SamTools	Format conversion	Yes [6]	http://code.google.com/p/pe-asm/
prinseq	Quality control	No	http://prinseq.sourceforge.net
Kent Tools	Format conversion	No	http://hgwdev.cse.ucsc.edu/~kent/
Bedtools	Data mining	YES – [7]
Repeatmasker	Cleaning	No	http://www.repeatmasker.org/

HTS data analysis packages

SAMtools, the program package distributed with the SAM format (Win,Linux,MacOS)
BioConductor (R) packages for HTS data analysis

Using R packages

Manuals for HT Sequence Analysis with R and Bioconductor

ShortRead - Quality assesment of the reads, finding duplicates, trimming, string pattern searches [8]
Biostrings - Reading sequences in R [9]
BSGenomes - Reading in complete genomes and BioC annotation data [10]
DEGSeq - Identify differentially expressed genes from RNA-Seq data. [11]
IRanges - infrastructure for positional data.
biomaRt - interface to BioMart annotations.
rtracklayer - interface to online and other genome browsers.
chipseq & ChIPpeakAnno - Chip-Seq analysis.

Visualisation

Stand-alone viewing tools for high-throughput sequencing data

Tablet, SCRI
Integrative Genome Viewer, very powerful JAVA based viewer (Win, Mac, Linux)
EagleView, the Marth Lab
AnnoJ genome browser Visualising deep sequencing data and other genome annotation data.
MagicViewer, large-scale short reads can be displayed in a zoomable interface under user-defined color scheme through an operating system-independent manner
MapView - Visualising short reads alignments on a desktop computer
JGI Genome Browser - Visual tool for viewing assembled genomes.
Lightweight Genome Viewer - A small genome viewer

Purposes of HT sequencing

Transcriptome sequencing

Also called RNA-seq

SNP discovery

Sometimes referred to as "SNP-seq"

GigaBayes
SNPExpress, a database enabling researchers to input a SNP, gene, or a genomic region to investigate regions of interest for localized effects of SNPs on exon and gene level expression changes.
PolyBayes - SNP discovery from MarthLab

Structural variation

Copy Number Variation

Promotor analysis

ChIP-seq - DNA-protein interaction

Methylome sequencing

BS-seq - bisulfite treatment and sequencing

Small RNA profiling

mRNA expression profiling

Digital gene expression (DGE)

Other useful information sources

SeqAnswers - find all your answers on this popular next-generation sequencing wiki
de novo transcriptome analysis using 454 - Experimental protocols, articles, but also bioinformatic scripts (Perl) - MatzLab
[12] - a lot of software on Sequence Analysis

Next-generation sequencing

Contents

Manufacturers and technologies

Databases of reads

Data formats

Standard Analysis Workflow of HTS DNA reads

Assembly

Mapping

Quality Assessment

Tools

HTS data analysis packages

Using R packages

Visualisation

Stand-alone viewing tools for high-throughput sequencing data

Purposes of HT sequencing

Transcriptome sequencing

SNP discovery

Structural variation

Copy Number Variation

Promotor analysis

Methylome sequencing

Small RNA profiling

mRNA expression profiling

Other useful information sources

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox