Next-generation sequencing

From BITS wiki
(Redirected from NGS)
Jump to: navigation, search

[ Overviews | Main_Page ]

High-throughput sequencers, also called 'next-generation' ('next-gen' or 'ngs'), or sometimes 'second-generation' (as opposed to third generation) sequencers are technologies that deliver 10⁵ to several 10⁶ of DNA reads, covering millions of bases. It is being used to (re)sequence genomes, determine the DNA-binding sites of proteins (ChIP-seq), sequence transcriptomes (RNA-seq).

These technologies bring analysis of sequence information to another level. Rethinking experiments is crucial.

Manufacturers and technologies

Databases of reads

Data formats

When you have millions of reads, you want to get rid of the reads as soon as possible, since a read on its own does not contain relevant information. Merging overlapping reads (assembling) can lead already to a large reduction of data size. If you have reference genome available, you can align the reads to the reference genome (mapping) and store the positions, the counts and the sequence deviations to that reference genome.

  • MAQ .map format (a compressed binary file specifically designed for short read alignment)
  • AMOS A Modular Open-Source Assembler, assembly format used by velvet
  • SRF Sequence Read Format (also called Short Read Format), solid2srf, illumina2srf
  • MINSEQE Minimum Information about a high-throughput SeQuencing Experiment
  • FASTQ format is a common format for short reads with quality scores. It is supported in EMBOSS 6.1.0 as a sequence format. Quality scores are also used if the format is more explicitly named in EMBOSS: fastqsanger or fastqillumina

Standard Analysis Workflow of HTS DNA reads


Depending on the sample, reads may be assembled before being mapped to a reference genome (if there is any). Assembly will merge overlapping sequence into one sequence. Assembly is a very computationally demanding task.

  • MIRA - also part of the EMBASSY package.


See Mappers.

Quality Assessment

High-throughput sequencing data contains a fair amount of errors. To discern the sequencing errors from genuine sequencing different a sufficient sequencing depth is needed together with a good quality assessment of the reads


Tool NGS pipeline part RPM? Link to source
Cgatools Format conversion - Analysis:SNP detection No
FastQC Quality control No
Fastxtoolkit Quality control Yes [1] and dep. [2]
Filo Format conversion U.D. [3]
HiTec Quality control No
Picard-tools Format conversion U.D.
Pindel Analysis: SV No
Samstat Quality control No
Samtools Format conversion YES – rpmsearch
Bamtools Format Conversion No
Bamview Visualisation No
Tabix Format conversion U.D. [4]
Vcftools Format conversion - Data mining U.D. [5]
Genomeanalysis TK Quality control - Analysis No
pe-asm Assembling No
perl-Bio-SamTools Format conversion Yes [6]
prinseq Quality control No
Kent Tools Format conversion No
Bedtools Data mining YES – [7]
Repeatmasker Cleaning No

See also this NBIC wiki page

HTS data analysis packages

Using R packages

Manuals for HT Sequence Analysis with R and Bioconductor

  1. ShortRead - Quality assesment of the reads, finding duplicates, trimming, string pattern searches [8]
  2. Biostrings - Reading sequences in R [9]
  3. BSGenomes - Reading in complete genomes and BioC annotation data [10]
  4. DEGSeq - Identify differentially expressed genes from RNA-Seq data. [11]
  5. IRanges - infrastructure for positional data.
  6. biomaRt - interface to BioMart annotations.
  7. rtracklayer - interface to online and other genome browsers.
  8. chipseq & ChIPpeakAnno - Chip-Seq analysis.


Stand-alone viewing tools for high-throughput sequencing data

Purposes of HT sequencing

Transcriptome sequencing

Also called RNA-seq

SNP discovery

Sometimes referred to as "SNP-seq"

  • GigaBayes
  • SNPExpress, a database enabling researchers to input a SNP, gene, or a genomic region to investigate regions of interest for localized effects of SNPs on exon and gene level expression changes.
  • PolyBayes - SNP discovery from MarthLab

Structural variation

Copy Number Variation

Promotor analysis

  • ChIP-seq - DNA-protein interaction

Methylome sequencing

  • BS-seq - bisulfite treatment and sequencing

Small RNA profiling

mRNA expression profiling

Digital gene expression (DGE)

Other useful information sources