Next-generation sequencing

From BITS wiki
Jump to: navigation, search

[ Overviews | Main_Page ]



High-throughput sequencers, also called 'next-generation' ('next-gen' or 'ngs'), or sometimes 'second-generation' (as opposed to third generation) sequencers are technologies that deliver 10⁵ to several 10⁶ of DNA reads, covering millions of bases. It is being used to (re)sequence genomes, determine the DNA-binding sites of proteins (ChIP-seq), sequence transcriptomes (RNA-seq).

These technologies bring analysis of sequence information to another level. Rethinking experiments is crucial.

Manufacturers and technologies

Databases of reads

Data formats

When you have millions of reads, you want to get rid of the reads as soon as possible, since a read on its own does not contain relevant information. Merging overlapping reads (assembling) can lead already to a large reduction of data size. If you have reference genome available, you can align the reads to the reference genome (mapping) and store the positions, the counts and the sequence deviations to that reference genome.

  • MAQ .map format (a compressed binary file specifically designed for short read alignment)
  • AMOS A Modular Open-Source Assembler, assembly format used by velvet
  • SRF Sequence Read Format (also called Short Read Format), solid2srf, illumina2srf
  • MINSEQE Minimum Information about a high-throughput SeQuencing Experiment
  • FASTQ format is a common format for short reads with quality scores. It is supported in EMBOSS 6.1.0 as a sequence format. Quality scores are also used if the format is more explicitly named in EMBOSS: fastqsanger or fastqillumina

Standard Analysis Workflow of HTS DNA reads

Assembly

Depending on the sample, reads may be assembled before being mapped to a reference genome (if there is any). Assembly will merge overlapping sequence into one sequence. Assembly is a very computationally demanding task.

  • MIRA - also part of the EMBASSY package.

Mapping

See Mappers.

Quality Assessment

High-throughput sequencing data contains a fair amount of errors. To discern the sequencing errors from genuine sequencing different a sufficient sequencing depth is needed together with a good quality assessment of the reads

Tools

Tool NGS pipeline part RPM? Link to source
Cgatools Format conversion - Analysis:SNP detection No http://cgatools.sourceforge.net/
FastQC Quality control No http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
Fastxtoolkit Quality control Yes [1] and dep. [2] http://hannonlab.cshl.edu/fastx_toolkit/
Filo Format conversion U.D. [3] https://github.com/arq5x/filo/
HiTec Quality control No http://www.csd.uwo.ca/~ilie/HiTEC/
Picard-tools Format conversion U.D. http://picard.sourceforge.net/command-line-overview.shtml
Pindel Analysis: SV No https://trac.nbic.nl/pindel/
Samstat Quality control No http://samstat.sourceforge.net/
Samtools Format conversion YES – rpmsearch http://samtools.sourceforge.net/
Bamtools Format Conversion No https://github.com/pezmaster31/bamtools
Bamview Visualisation No http://bamview.sourceforge.net/
Tabix Format conversion U.D. [4] http://sourceforge.net/projects/samtools/files/
Vcftools Format conversion - Data mining U.D. [5] http://vcftools.sourceforge.net/
Genomeanalysis TK Quality control - Analysis No http://www.broadinstitute.org/gsa/wiki/index.php/Main_Page#The_Genome_Analysis_Toolkit_.28GATK.29
pe-asm Assembling No http://code.google.com/p/pe-asm/
perl-Bio-SamTools Format conversion Yes [6] http://code.google.com/p/pe-asm/
prinseq Quality control No http://prinseq.sourceforge.net
Kent Tools Format conversion No http://hgwdev.cse.ucsc.edu/~kent/
Bedtools Data mining YES – [7]
Repeatmasker Cleaning No http://www.repeatmasker.org/

See also this NBIC wiki page

HTS data analysis packages

Using R packages

Manuals for HT Sequence Analysis with R and Bioconductor

  1. ShortRead - Quality assesment of the reads, finding duplicates, trimming, string pattern searches [8]
  2. Biostrings - Reading sequences in R [9]
  3. BSGenomes - Reading in complete genomes and BioC annotation data [10]
  4. DEGSeq - Identify differentially expressed genes from RNA-Seq data. [11]
  5. IRanges - infrastructure for positional data.
  6. biomaRt - interface to BioMart annotations.
  7. rtracklayer - interface to online and other genome browsers.
  8. chipseq & ChIPpeakAnno - Chip-Seq analysis.

Visualisation

Stand-alone viewing tools for high-throughput sequencing data

Purposes of HT sequencing

Transcriptome sequencing

Also called RNA-seq

SNP discovery

Sometimes referred to as "SNP-seq"

  • GigaBayes
  • SNPExpress, a database enabling researchers to input a SNP, gene, or a genomic region to investigate regions of interest for localized effects of SNPs on exon and gene level expression changes.
  • PolyBayes - SNP discovery from MarthLab

Structural variation

Copy Number Variation

Promotor analysis

  • ChIP-seq - DNA-protein interaction

Methylome sequencing

  • BS-seq - bisulfite treatment and sequencing

Small RNA profiling

mRNA expression profiling

Digital gene expression (DGE)

Other useful information sources