Overview of RNA-seq analysis for differential gene expression

From BITS wiki
Jump to: navigation, search


work in progress

RNA-seq can serve different purposes, listed below. Set your goal of your RNA-seq analysis before generating the samples. - detection of novel genes to complete the transcriptome or assist in genome annotation - detection of novel isoforms of genes - detection of differentially expressed genes between samples - detection of differentially expressed exons between samples - detection of differentially spliced genes between samples

In this tutorial below we demonstrate how RNA-seq analysis for differential gene expression is performed at BITS.

General remarks about quantifying gene expression by RNA-seq

The ultimate purpose is to estimate the differences in concentration in transcripts between tissues (or cell types), with the premise that differential concentration leads a biological effect in the tissues under investigation.

Between identical samples, there is a natural fluctuation of the concentration of transcripts emerging of a certain gene. This concentration is commonly referred to as 'level of gene expression'. It is assumed that the concentration of transcripts or the 'gene level expression' is normally distributed (continuous). So, identical cells contain on average identical concentration of transcripts.

RNA-seq is an random high-throughput sequencing of the pool of RNA molecules, via an cDNA intermediate. The outcome of this process are a 'reads', collection of short strings representing sequence parts of the transcripts in the sample. Per sample RNA-seq typically generates from about 5 million reads to 100 million reads, but there is no limit.

Due to this random nature of sequencing - this random sampling - transcripts with a higher concentration are more represented in the final catalogue of reads. This is the basis for differential gene expression using RNA-seq, although has to be corrected for before we can arrive to conclusions. So the first step is to assign every read in our set of reads to one transcript. We can do this if the genome sequence is known, and even if the genome sequence is not known (de novo assembly). This transcript is then said to have this many 'reads'. The same applies when we gather all isoforms (different versions of transcripts) per gene: every gene is said to be represented by a certain count of reads or shortly count.

Stochastic nature of sampling Low counts: high variability, lower variance Higher counts: lower variability, higher variance

Experimental design

Quality control of the raw data

Mapping

Quality control of the mapping

Count table extraction

Quality control of the count table

Normalization of the count table

  • including qc: sample clustering

Testing for differential expression

Visualisation of the data

Extracting biological meaning from gene expression differences