NGS RNASeq DE Exercise.5

From BITS wiki
Jump to: navigation, search

The rnaseqGene DESeq2 R-workflow


[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.4 | NGS_RNASeq_DE_Exercise.6 ]



The dataset used in this session was also selected by the authors of the DESeq2 package for their vignette[1]. We will therefore reproduce below key steps of the detailed DESeq2 training document to quickly perform them together. It is strongly suggested that you fully read the original document in order to understand each command and the rationale behind it.

Various bioconductor files related to DESeq2 are often updated and should better be retrieved at the time of repeating this training from the bioconductor repository[2].

An RNA-Seq workflow rnaseqGene on the Bioconductor website[3] (formerly the Beginner’s Guide PDF), covers similar material to this vignette but at a slower pace, including the generation of count matrices from FASTQ files.

If you refer to the document, you may jump Starting from SummarizedExperiment to match step into the part related to this exercise.

Technical.png The code below was almost literally copied from the published pages or adapted to match the previously created data


Standard Workflow (simplified!)

Mapping with tophat2 followed by read count using HTseq led to a number of count files (one per sample) that were saved to a local folder. We here take over in RStudio by loading this data in and using it to construct a large count table. The count table contains the number of sequence fragments that were assigned by the mapping to a certain gene and this for all genes in all samples.

Install and load required R-packages

These libraries are required in the current exercise.



Locate and load data


We create a metadata table that will help ordering, merging and comparing the results. Files have been created by HTSeq-count or STAR and can be found on the BITS server.

The goal of the analysis is to find DE genes (genes with different expression levels in the conditions that are being compared. Typically you have two groups of samples, one consisting of bological replicates that have received a control treatment, the other consisting of biological replicates that received a specific treatment. [R] needs to know the metadata of the experiment i.e. which data file belongs to which group. The metadata will then be used to assemble the groups and compute differential expression.

In the experiment that we are analyzing the data consists of four groups (treatments):

  • The dex group: RNA samples from 4 cell lines after treatment with the glucocorticoid dexamethasone (dex), used as astma medication
  • The alb group: RNA samples from the same cell lines after treatment with albuterol (alb), another common astma medication
  • The alb_dex group: RNA samples from the same cell lines after treatment with both astma medications
  • The untreated group: RNA samples from the same four untreated cell lines cultured in parallel.
So all samples come from the same 4 cell lines (cells).

#   run_accession  read_count  samples            cells    treatment
1   SRR1039508     22935521    N61311_untreated   N61311   untreated
2   SRR1039509     21155707    N61311_Dex         N61311   Dex
3   SRR1039510     22852619    N61311_Alb         N61311   Alb
4   SRR1039511     21938637    N61311_Alb_Dex     N61311   Alb_Dex
5   SRR1039512     28136282    N052611_untreated  N052611  untreated
6   SRR1039513     43356464    N052611_Dex        N052611  Dex
7   SRR1039514     40021727    N052611_Alb        N052611  Alb
8   SRR1039515     29454748    N052611_Alb_Dex    N052611  Alb_Dex
9   SRR1039516     30043024    N080611_untreated  N080611  untreated
10  SRR1039517     34298260    N080611_Dex        N080611  Dex
11  SRR1039518     30441200    N080611_Alb        N080611  Alb
12  SRR1039519     31533740    N080611_Alb_Dex    N080611  Alb_Dex
13  SRR1039520     34575286    N061011_untreated  N061011  untreated
14  SRR1039521     41152075    N061011_Dex        N061011  Dex
15  SRR1039522     28406812    N061011_Alb        N061011  Alb
16  SRR1039523     31099538    N061011_Alb_Dex    N061011  Alb_Dex

Metadata for our analysis

We will only do the comparison between dex and untreated samples, so the metadata file on the bits laptops only contains metadata for these samples.

The 'metadata' table is found in the RNASeq folder in the Documents folder on the BITS laptops. It is loaded into RStudio and used to build a design table

#set working directory####
#this is the folder where R will look for data and write files
#please adapt this location to match your own computer
basedir <- "C:/Users/VIB/Documents/RNASeq/"
#import metadata####
#this is a description of the samples and the groups you want to compare
#the metadata file is in the folder you have set as working directory
#you have to specify that the first row contains no data but column labels 
metadata <- read.table("GSE52778_metadata.txt", header = TRUE)
#add a column to the metadata table containing the names of the count table files
metadata$sampleFiles <- paste( metadata$run_accession, "_all_counts.txt", sep="")
#create a sample table in the format DEseq2 wants it####
#the sample table is a data frame (a special type of table in R) consisting of 4 columns:
#1. names of the runs
#2. names of the files containing the counts
#3. names of the cell lines
#4. treatment
sampleTable <- data.frame(sampleName = metadata$run_accession, 
                          fileName = metadata$sampleFiles, 
                          cells = metadata$cells, 
                          treatment = metadata$treatment)

The result is a table with these 4 columns and 8 rows, each row representing one of the samples that we want to compare.

  sampleName                  fileName   cells treatment
1 SRR1039508 SRR1039508_all_counts.txt  N61311 untreated
2 SRR1039509 SRR1039509_all_counts.txt  N61311       Dex
3 SRR1039512 SRR1039512_all_counts.txt N052611 untreated
4 SRR1039513 SRR1039513_all_counts.txt N052611       Dex
5 SRR1039516 SRR1039516_all_counts.txt N080611 untreated
6 SRR1039517 SRR1039517_all_counts.txt N080611       Dex
7 SRR1039520 SRR1039520_all_counts.txt N061011 untreated
8 SRR1039521 SRR1039521_all_counts.txt N061011       Dex


Loading and locating the counts

Load the merged HTSeq data into a DESeq2 object

DESeq2[4] is one of the most popular analysis package for RNASeq data today. Its normalization model is a negative binomial model and differs from the original DESeq. This model is considered well suited for modeling counts (discrete variable) obtained from HTSeq.

Since the 4 samples in each group come from the same 4 cell lines, the data sets are considered paired: the data in each group is related to the data in the other groups. This means that we have to include the pairing in the test for differential expression as this will allow taking into account the genetic background shared across observations of the same line.

The DESeq2 package has a specific method to read in individual HTSeq count files summarized in the sampleTable. We use this method here to be consistent and use the sampleTable created previously. To read in the files you just have to specify the folder that contains them. On the bits laptops the files are in the htseq_counts folder in the working directory.

In the metadata, the pairing was represented in the cells column whereas the treatment was represented in the treatment column.

This is why the design is defined as ~ cells + treatment. This is the way to represent the design of a paired experiment in R: ~ means "is modeled by" and is followed by all factors (both pairing and grouping factors). If you have more than one factor (as we have since we have a pairing and a grouping factor) you combine them using the + sign. In the case of DESeq2 the pairing variable has to be specified as the first factor whereas the grouping variables should go at the end of the design formula.

This design is essentially a matrix that provides a representation of the different RNA samples, telling R which samples you want to compare. If the data was unpaired and you just wanted to compare the two or more treatments the design would have been ~ treatment.

If you have more than one grouping variable, e.g. you measured each sample on two time points then your design would be ~ cells + treatment + time. This model assumes that the effect of the treatment is the same on the two time points. If you think that this is not the case you should add an interaction term to the design: ~ cells + treatment + time + treatment:time.

More info on defining designs can be found here. Defining the design for complex studies with multiple factors can be quite challenging, we advise you to consult a statistician or a bioinformatician to check your design.

#importing the data####
#Define the folder that contains the files so DESeq2 can find them.
folder <- "htseq_counts"
#Define the sampleTable created previously 
#so R DESeq2 knows the names of the files to load and the groups to compare.
#Define the design: tell DESeq2 how you want to do the comparison
#see wiki for more details on the design
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                  directory = folder,
                                  design = ~ cells + treatment)
#view the created object

Inspection of the DESeqDataSet

This creates a DESeqDataSet, a complex data type in DESeq2 that holds both counts table and metadata. The counts table has counts for 57773 genes in 8 samples.

class: DESeqDataSet
dim: 57773 8
metadata(1): version
assays(1): counts
rownames(57773): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(2): cells treatment

To view the annotation of the columns of the count table you can use the colData function.

# column information

DataFrame with 8 rows and 2 columns
              cells treatment
           <factor>  <factor>
SRR1039508   N61311 untreated
SRR1039509   N61311       Dex
SRR1039512  N052611 untreated
SRR1039513  N052611       Dex
SRR1039516  N080611 untreated
SRR1039517  N080611       Dex
SRR1039520  N061011 untreated
SRR1039521  N061011       Dex

Factors are categorical variables (non-numerical variables). They have to be transformed into factors, otherwise R can't process them. By default, R handles the levels (names of the categories) in alphabetical order, meaning that R would use Dex as the reference treatment (since D comes first in the alphabet compared to u). We want to use untreated as reference so we have to relevel the treatment factor.

# relevel to get untreated as reference
dds$treatment <- relevel(dds$treatment, "untreated")


Differential expression analysis

The DESeq function is a wrapper that performs a default analysis through four steps:

  • estimation of size factors
  • estimation of dispersions by Negative Binomial GLM fitting
  • calculation of log fold changes
  • testing for differential expression using the Wald test

Calculation of dispersions

Dispersions are measures of the biological variation in each group. They are estimated by fitting a negative binomial model to the raw counts of biological replicates.

# run the combined DESeq2 analysis
dds <- DESeq(dds)
# estimating size factors
# estimating dispersions
# gene-wise dispersion estimates
# mean-dispersion relationship
# final dispersion estimates
# fitting model and testing
# Dispersion plot and fitting alternatives


In the plot, dispersion estimates are plotted versus the mean normalized count of each gene (black dots). A curve (red) is fit to the resulting data cloud to capture the overall trend of dispersion-mean dependence. This curve is used to 'shrink' the dispersion coefficients towards the consensus represented by the red curve (blue points). The black points circled in blue are detected as dispersion outliers and not shrunk toward the curve. This shrinkage assumes that genes with similar expression levels should have similar dispersion.

Calculation of size factors

Size factors are related to library size normalization. They are estimated as follows:

  • Take the geometric mean over all samples for each gene and use that as the reference expression data set.
  • For each sample, get a list of quotients of each gene expression value to its reference expression.
  • The median of each condition quotient list is the normalization factor (=size factor) for that data set.

# get size factors used for normalization

SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
 1.0123484  0.8951975  1.1747041  0.6739964  1.1729844  1.4049307  0.9193101  0.9527669

Calculation of log fold changes

Next DESeq calculates log2 fold changes (log2FoldChange) using the normalized counts. A problem occurring in all RNA-Seq data sets is the fact that genes with low read counts have unrealistically high log fold changes. This phenomenon is a direct consequence of dealing with count data, in which ratios are inherently noisier when counts are low. DESeq handles this problem by shrinking log fold changes toward zero in a manner such that shrinkage is stronger when counts are low or dispersion is high. This is done by using a regularized logarithm transformation (rlog), which behaves similarly to a log2 transformation for genes with high counts, while shrinking the values for genes with low counts. To do an rlog transformation you should use the rlog() function on the raw dds object (so without running the DESeq() function on it). For a detailed description see section 3 of this page.

rld <- rlog(dds, blind=FALSE)
The DESeq2 developers state that the rlog transformation should be used for applications other than differential testing, for example clustering of samples or other machine learning applications. For differential testing they recommend the DESeq() function applied to raw counts. Since we are interested in DE we are going to do the latter.

Estimation of differential expression

The results file is large containing results for all 55773 genes see it is not a good idea to look at the full results table, use head() instead. Head() only shows the first rows of the file, you can specify the number of rows to show.

#store results into a new object
res <- results(dds)
#results, 4 first rows
head(res, 4)

This shows a file containing 6 columns

log2 fold change (MAP): treatment Dex vs untreated
Wald test p-value: treatment Dex vs untreated
DataFrame with 4 rows and 6 columns
                 baseMean log2FoldChange     lfcSE       stat       pvalue         padj
                <numeric>      <numeric> <numeric>  <numeric>    <numeric>    <numeric>
ENSG00000000003  698.9255    -0.38812002 0.1001493 -3.8754129 0.0001064441 0.0008072674
ENSG00000000005    0.0000             NA        NA         NA           NA           NA
ENSG00000000419  475.4853     0.21128035 0.1131837  1.8667038 0.0619429686 0.1695057180
ENSG00000000457  251.9015     0.01368663 0.1325385  0.1032653 0.9177523856 0.9590952696

The lfcSE column contains the standard errors of the log2FoldChange estimates.

DESeq performs the Wald test for differential expression, one test for each gene. The Wald test divides the estimate of the log fold change (log2FoldChange) by its standard error (lfcSE), resulting in the Wald statistic (stat). The statistic is compared to a standard normal distribution, resulting in a p-value (pvalue).

Since we perform multiple tests (one for each gene) on the same data set, the p-values must be adjusted for multiple testing. Adjustment is done using Benjamini and Hochberg's method to control the false discovery rate (FDR). The meaning of the adjusted p-value (padj) is as follows: if you select all genes with adjusted p-value below 0.05 as DE, then the expected proportion of false discoveries in the selected group is less than 5% (= FDR). So if you have 1000 DE genes in a total of 20000 genes, 50 of them will be false positives (5% of 1000) and will in reality not be DE.

When you report the results (list of DE genes) you should report the adjusted p-values !

Setting thresholds for differential expression

Some genes contain missing values (represented by NA). These are genes with zero counts in one or both groups of samples. When calculating the log fold change this gives an error since you cannot compute the log of zero. You can remove these genes from the results table by checking if the adjusted p=value equals NA using the function.

# keep only data with computed padj
resFilt <- res[!$padj),]

Check the first 4 rows of the filtered results to see if the gene with the missing p-values was removed.

log2 fold change (MAP): treatment Dex vs untreated
Wald test p-value: treatment Dex vs untreated
DataFrame with 4 rows and 6 columns
                 baseMean log2FoldChange     lfcSE       stat       pvalue         padj
                <numeric>      <numeric> <numeric>  <numeric>    <numeric>    <numeric>
ENSG00000000003 698.92549    -0.38812002 0.1001493 -3.8754129 0.0001064441 0.0008072674
ENSG00000000419 475.48532     0.21128035 0.1131837  1.8667038 0.0619429686 0.1695057180
ENSG00000000457 251.90150     0.01368663 0.1325385  0.1032653 0.9177523856 0.9590952696
ENSG00000000460  50.84756    -0.02705074 0.2608112 -0.1037177 0.9173933898 0.9589720870

The results file contains the genes in original order, it might be better to reorder them putting the most significant genes (lowest p-value) at the top.

#reorder the results by decreasing significance
resOrdered <- res[order(res$padj),]

Check the first 6 rows of the ordered results to see if the order of the genes was changed.

                  baseMean log2FoldChange     lfcSE      stat        pvalue          padj
                 <numeric>      <numeric> <numeric> <numeric>     <numeric>     <numeric>
ENSG00000165995   514.2841       3.321662 0.1307366  25.40728 2.095444e-142 3.170407e-138
ENSG00000152583   985.5593       4.340812 0.1760858  24.65169 3.529480e-134 2.670051e-130
ENSG00000120129  3325.4027       2.873150 0.1167734  24.60448 1.131211e-133 5.705074e-130
ENSG00000101347 13616.9348       3.606558 0.1517550  23.76566 7.569343e-125 2.863104e-121
ENSG00000189221  2294.7300       3.231983 0.1396172  23.14888 1.491976e-118 4.514719e-115
ENSG00000211445 12162.4869       3.540681 0.1573440  22.50281 3.895921e-112 9.824214e-109

Setting different thresholds for DE

#select genes with adjusted p-value < 0.1 that are at least two-fold DE
resFilt[abs(resFilt$log2FoldChange)>=1 & (resFilt$padj < 0.1),]

Select genes with adjusted p-value < 0.05 and store in an object called resSig

More information on results columns: information about which variables and tests were used can be found by calling the function mcols() on the results object.

# [1] "mean of normalized counts for all samples"
# [2] "log2 fold change (MAP): treatment Dex vs untreated"
# [3] "standard error: treatment Dex vs untreated"
# [4] "Wald statistic: treatment Dex vs untreated"
# [5] "Wald test p-value: treatment Dex vs untreated"
# [6] "BH adjusted p-values"


Detection outliers and filtering genes with very low counts

# We can summarize some basic statistics using the summary function.

This lists the number of genes that are considered up- and downregulated according to the Wald test.

DESeq uses the Cook's distance to detect outliers that do not fit the distributional assumptions of the model. An example of such an outlier would be a gene with single-digit counts for all samples, except one sample with a count in the thousands. As you can see in our data set no outliers were detected.

Additionally, DESeq filters out genes with very low counts (1 or 2 in a few samples) to reduce the number of tests that needs to be performed. This is why genes that have little or no chance of being detected as DE (genes with very low counts) are omitted from the testing. DESeq2 uses the average expression level of each gene, across all samples, as filter criterion, and it omits all genes with mean normalized counts below a filtering threshold. DESeq2 by default will choose a threshold that maximizes the number of genes found at a user-specified target FDR.

out of 28730 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 2467, 8.6%
LFC < 0 (down)   : 2155, 7.5%
outliers [1]     : 0, 0%
low counts [2]   : 11602, 40%
(mean count < 4)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

As you can see the FDR is set to 10% in the summary. If you think this is too high you can make the summary more stringent.

## being more stringent
summary(res, alpha=0.01)

out of 28730 with nonzero total read count
adjusted p-value < 0.01
LFC > 0 (up)     : 1562, 5.4%
LFC < 0 (down)   : 1284, 4.5%
outliers [1]     : 0, 0%
low counts [2]   : 13600, 47%
(mean count < 11.2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

How many genes are up- and down-regulated when you apply an FDR of 0.1% ?

Plotting results

MA plots using the plotMA command

MA plots show the estimated log fold changes (Y-axis) over average expression level (X-axis) with red points indicating genes with low adjusted p-values (DE genes).

# MA-plot
# legend: Points will be colored red if the adjusted p value is less than 0.1. 
#  Points which fall out of the window are plotted as open triangles pointing either up or down.
plotMA(res, main="MAplot after DESeq2 analysis", ylim=c(-2,2))



MA plots using ggplot2

It is possible to create the MA-plot in ggplot2, this gives you freedom to define the plot exactly as you want it. However, ggplot2 can't handle DESeqResults objects, it uses a simple data frame as an input.

#ggplot2 can't handle DESeqResults objects
ggplot(res, aes(x=baseMean,y=log2FoldChange)) + geom_point()

This gives the following error:

Error: ggplot2 doesn't know how to deal with data of class DESeqResults

To resolve this we need to transform the DESeqResults object into a data frame using the function.

results <-

Create a default scatter plot of the data using ggplot()

This gives a plot that looks nothing like the plot generated by plotMA.


First we'll fix the X-axis. To make it alike the plotMA() plot, we have to use a log10 scale on the X-axis and set its range to 0,1 to 300000.

#default scatter plot of the data
ggplot(results, aes(x=baseMean,y=log2FoldChange)) + geom_point() + scale_x_continuous(trans="log10",limits=c(0.1,300000))

This looks a lot better


Now colour the data points with low adjusted p-values (< 0.1) red

ggplot(results, aes(x=baseMean,y=log2FoldChange)) + geom_point(aes(colour=padj<0.1)) +
  scale_colour_manual(name='padj < 0.1',values=setNames(c('red','black'),c(T,F))) +


You see that you have lost some points especially in the region with small X-values. These are points with missing adjusted p-values. plotMA keeps these points in the plot coloring them black. The easiest way to fix this if to replace the missing p-values by 0.99

Create a new variable resFixed, a copy of results. In resFixed replace missing values by 0.99 and make the plot again

Now the plot looks much better


Downsize the dots to size 0.5

ggplot(resFixed, aes(x=baseMean,y=log2FoldChange)) + geom_point(aes(colour=padj<0.1),size=0.5) +
  scale_colour_manual(name='padj < 0.1',values=setNames(c('red','black'),c(T,F))) +

It's starting to become a very nice plot


The MA plot contains a red line that fits the data points and is obtained via non-linear regression using Loess smoothing. Loess divides the data set in the local subsets of points with similar x-values (i.e. close to each other on the plot). It fits a simple linear model to these subsets overall generating a non-linear curve consisting of small linear parts.
Add a red smoother line to the plot

ggplot(resFixed, aes(x=baseMean,y=log2FoldChange)) + geom_point(aes(colour=padj<0.1),size=0.5) +
  scale_colour_manual(name='padj < 0.1',values=setNames(c('red','black'),c(T,F))) +
  scale_x_continuous(trans="log10",limits=c(0.1,300000)) + geom_smooth(colour="red")


The MA plot contains a blue horizontal line which serves as a reference to the red smoother line. In theory the red curve and the blue line should coincide. Sine you expect that there are equal amounts of genes with high and genes with low log fold changes, the red curve (representing the average gene) should coincide with log fold change = 0 (the blue horizontal line Y = 0).
Add a blue horizonzal line to the plot

ggplot(resFixed, aes(x=baseMean,y=log2FoldChange)) + geom_point(aes(colour=padj<0.1),size=0.5) + 
  scale_colour_manual(name='padj < 0.1',values=setNames(c('red','black'),c(T,F))) +
  scale_x_continuous(trans="log10",limits = c(0.1,300000)) + geom_smooth(colour="red") +


Plot counts using the plotCounts() function

#retrieve the name of the best scoring gene (with the lowest padj)
#gene names are used as rownames in the results table
onegene <- rownames(res)[which.min(res$padj)]
#plot the counts for this gene


You can also plot the counts of a gene by using its gene name

plotCounts(dds, gene="ENSG00000000003", intgroup="treatment", main="ENSG00000000003")

Plot the counts of gene ENSG00000154734

You can also plot counts for each cell line instead of comparing between treatments

plotCounts(dds, gene=which.min(res$padj), intgroup="cells", main=onegene)


Plot the counts for each cell line of gene ENSG00000154734

Plot counts using ggplot2

An alternative way to transform a complex data type like a DESeqResults object (like res) or a DESeqDataSet (like dds) into a data frame that ggplot2 can interpret is to store the results of the plot functions into a variable

d <- plotCounts(dds,gene=which.min(res$padj),intgroup="treatment",main=onegene,returnData=TRUE)

As you can see d is a data frame.

Plot the counts of d in a default scatter plot.


Use a log10 transformed Y-scale.


Set tick marks on the Y-axis at 25 (start plot), 100 and 400

ggplot(d,aes(x=treatment,y=count)) + geom_point() + scale_y_continuous(trans="log10",breaks=c(25,100,400))


Add the name of the gene as the title to the plot

ggplot(d, aes(x=treatment, y=count)) + geom_point() + scale_y_continuous(trans="log10",breaks=c(25,100,400)) + 


If you want to colour the dots according to the cell line they comre from, you need to store this information into the d data frame. When we created d, we only stored info about the treatment by defining intgroup="treatment".

Plot the counts of d in a scatter plot with log transformed Y axis, tick marks and title. Colour the dots according to cell line.


Plot distribution of log fold changes using the hist() function

We will plot the distribution of the log2 fold changes of all genes that are considered DE according to the following threshold: padj < 0.05

#selection of these genes was already done: they were stored in resSig
#hist() plots a frequency distribution 
#xlim sets the range of the X-axis
#breaks specifies number of bins in the histogram
hist(resSig$log2FoldChange,xlim=c(-6,6),breaks=20,main="LFC distribution for genes with padj<0.05")

As you see in the plot, many DE genes have rather low log fold changes. Typically, scientists only report DE genes with minimally 2-fold, 3-fold or 4-fold changes.

Make a histogram of log fold changes of these genes.


Plot distribution of log fold changes using ggplot2

Create a default histogram of the log fold changes of resString

#default histogram
ggplot(resString,aes(x=log2FoldChange)) + geom_histogram()


Make a histogram with 20 bins

#default histogram
ggplot(resString,aes(x=log2FoldChange)) + geom_histogram(bins=20)


Set the range X-axis from -6 to 6


Volcano plot using the plot() function

Next we create a Volcano plot for the DE genes. The Volcano plot arranges the DE genes along dimensions of biological and statistical significance. The X-axis represents the log fold change between the two groups, and the Y-axis represents the adjusted p-value (on an inverse scale so smaller p-values appear higher up). The X-axis indicates biological relevance; the Y-axis indicates the statistical evidence, or reliability of the change in expression. The researcher can make judgements about the most promising candidates for follow-up studies, by trading off both these criteria by eye.

# Volcano plot for the subset with padj < 0.05
plot(resSig$log2FoldChange, 1-resSig$padj,xlim=c(-6,6),main="volcano plot for genes with padj<0.05")

Make the same plot but now for genes with 2-fold changes or higher and padj < 0.01



Volcano plot using ggplot2

Create a default scatter plot of the resString data


Use hollow points

# Volcano plot for the subset with padj < 0.05
ggplot(resString,aes(x=log2FoldChange,y=padj)) + geom_point(shape=1)


Invert the Y-axis

# Volcano plot for the subset with padj < 0.05
ggplot(resString,aes(x=log2FoldChange,y=padj)) + geom_point(shape=1) + scale_y_reverse()


Heat map using the heatmap() function

We will create a heat map showing the normalized counts of the 20 most significant genes (genes with the lowest padj).

#specify a color palette
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
#select normalized counts for 20 most significant genes
#view 20 most significant genes
#view gene names of 20 most significant genes
#select rows from dds of these genes
select <- dds[rownames(resOrdered[1:20,]),]
#show heat map of normalized counts for these genes
heatmap.2(counts(select,normalized=TRUE), col = hmcol, Rowv = FALSE, Colv = FALSE,
          scale="none", dendrogram="none", trace="none", margin=c(10,6))

Make the same plot for the 10 most significant genes.

Heat map using ggplot2

We need to reformat the data to make a tile plot. Now the data is in the following format:

                SRR1039508 SRR1039509  SRR1039512 SRR1039513
ENSG00000165995   76.06077   861.2625   114.07128   1275.971
ENSG00000120129  637.13244  5499.3451  1004.50828   6608.344
ENSG00000152583   57.29253  2267.6560    80.02015   1724.045
ENSG00000101347 1700.00763 20487.0991  1913.67340  31566.934

This is called wide format. However, ggplot expects data in long format like this:

1  ENSG00000165995 SRR1039508    76.06077
2  ENSG00000120129 SRR1039508   637.13244
3  ENSG00000152583 SRR1039508    57.29253
4  ENSG00000101347 SRR1039508  1700.00763

To transform data from wide to long format we can use the melt() function of the reshape2 package:

datamelted <- melt(data)

This indeed gets the data in the right format

> datamelted
              Var1       Var2       value
1  ENSG00000165995 SRR1039508    76.06077
2  ENSG00000120129 SRR1039508   637.13244

Now make the default tile plot with ggplot2

ggplot(datamelted,aes(x=Var2,y=Var1)) + geom_tile(aes(fill=value))


To change the colors of the color gradient, you can define a color for low values and a color for high values:

ggplot(datamelted,aes(x=Var2,y=Var1)) + geom_tile(aes(fill=value)) + scale_fill_gradient(low="white",high="steel blue")


Color the border of the tiles white


The X-axis is still pretty messy with overlapping sample names. To print the sample names vertically use a theme element:

ggplot(datamelted,aes(x=Var2,y=Var1)) + geom_tile(aes(fill=value),colour="white") + 
    scale_fill_gradient(low="white",high="steel blue") + theme(axis.text.x=element_text(angle=90))


Exporting results to HTML or CSV files

We use write.csv() to write a data frame to a csv (comma separated values) file. The file we want to save is the ordered results: resOrdered. We have to check first if this is a data frame using the class() function.

write.csv(, file="Dex_vs_untreated_results.csv")

Since it's not a data frame we transform it into a data frame using Using the file argument we provide a name for the csv file. The write.csv() will now write the resOrdered data frame to a file called Dex_vs_untreated_results.csv in your working directory. If you want to save it elsewhere you have to provide a path on your computer instead of a file name.

Data Transformation and Visualization

Data transformation can be applied in different ways to compensate for the dispersion at low and high expression ranges. These methods were developed after the first wave of RNASeq analysis to try compensate for effects associated with this type of data. Remember that weakly expressed genes show much higher log fold changes than strongly expressed genes. This phenomenon, seen in most RNASeq datasets, is a direct consequence of dealing with count data, in which ratios are inherently noisier when counts are low.

Therefore, it is useful to transform data to diminish this effect. If data are used on the original count scale, the result will be dominated by highly expressed, highly variable genes; if logarithm-transformed data are used, undue weight will be given to weakly expressed genes, which show exaggerated LFCs, as discussed above. Therefore, DESeq2 implements a regularized logarithm transformation (rlog), which behaves similarly to a log2 transformation for genes with high counts, while shrinking the values for genes with low counts.

Count data transformation

We present here two log fold change shrinkage methods used in DESq2:

  • Regularized log transformation
  • Variance stabilizing transformation

Regularized log transformation

# Regularized log transformation
rld <- rlog(dds)

class: SummarizedExperiment
dim: 57773 8
assays(1): ''
rownames(57773): ENSG00000000003 ENSG00000000005 ... ENSG00000273492 ENSG00000273493
rowData metadata column names(47): baseMean baseVar ... dispFit rlogIntercept
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(3): cells treatment sizeFactor

Variance stabilizing transformation

# Variance stabilizing transformation
vsd <- varianceStabilizingTransformation(dds)

class: SummarizedExperiment
dim: 57773 8
assays(1): ''
rownames(57773): ENSG00000000003 ENSG00000000005 ... ENSG00000273492 ENSG00000273493
rowData metadata column names(46): baseMean baseVar ... dispGeneEst dispFit
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(3): cells treatment sizeFactor

# create matrix for saving or further use
rlogMat <- assay(rld)

                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
ENSG00000000003   9.385815   9.111922   9.485295   9.293740   9.745669   9.502458   9.584332
ENSG00000000005   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
ENSG00000000419   8.790046   9.016464   8.887312   8.928876   8.843568   8.958822   8.761465
ENSG00000000457   7.979714   7.978258   7.915768   7.997564   7.953955   7.986459   8.036356
ENSG00000000460   5.702651   5.735228   5.226608   5.562887   5.683232   5.454195   5.909199
ENSG00000000938  -2.199605  -2.199305  -2.164748  -2.198557  -2.164706  -2.200320  -2.199371
ENSG00000000003   9.285416
ENSG00000000005   0.000000
ENSG00000000419   8.921048
ENSG00000000457   7.958014
ENSG00000000460   5.743980
ENSG00000000938  -2.199459

# create matrix for saving or further use
vstMat <- assay(vsd)

                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
ENSG00000000003   9.434437   9.016021   9.582868   9.294533   9.964458   9.608177   9.729533
ENSG00000000005   4.063874   4.063874   4.063874   4.063874   4.063874   4.063874   4.063874
ENSG00000000419   8.837919   9.175165   8.983583   9.045680   8.918437   9.089329   8.794405
ENSG00000000457   8.156567   8.154439   8.062614   8.183129   8.118785   8.166373   8.239793
ENSG00000000460   6.417750   6.465505   5.723628   6.213135   6.389237   6.066048   6.710763
ENSG00000000938   4.063874   4.063874   4.388676   4.063874   4.388913   4.063874   4.063874
ENSG00000000003   9.282765
ENSG00000000005   4.063874
ENSG00000000419   9.033742
ENSG00000000457   8.124620
ENSG00000000460   6.477112
ENSG00000000938   4.063874


Data quality assessment by sample clustering and visualization

Clustering is often used to show sample segregation and identify co-expressed genes across groups. These analyses are done on rlog transformed data as recommended by the developers. Example code below can be adapted to fit your needs and create publication-ready figures with some efforts and patience.

Heatmap of the sample-to-sample distances

Handicon.png Note the 'elegant' R trick to change names from unreadable IDs to something much nicer

# we need to transpose the data first with t()
distsRL <- dist(t(assay(rld)))
# store in to matrix
mat <- as.matrix(distsRL)
# rename samples
rownames(mat) <- colnames(mat) <- with(colData(dds),
          paste(treatment, cells, sep=" : "))
# cluster pairwise
hc <- hclust(distsRL)
# plot heatmap
          col = rev(hmcol),
          margin=c(13, 13)



Principal component plot of the samples

# simple plot without grouping
plotPCA(rld, intgroup=c("treatment", "cells"))

# It is also possible to customize the PCA plot using the ggplot function.
data <- plotPCA(rld, intgroup=c("treatment", "cells"), returnData=TRUE)
percentVar <- round(100 * attr(data, "percentVar"))
ggplot(data, aes(PC1, PC2, color=treatment, shape=cells)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance"))



Filtering and testing to create candidate DE gene lists

It is now time to restrict the transcriptome to the part that shows apparent differential expression under the assumptions defined by the user. Some cases require specific testing introduced below but for most cases, a simple subsetting of the full transformed table will be enough (see ?subset()).

Tests of log2 fold change above or below a threshold

In some cases, one only wants to know about Up-regulated or Down-regulated genes and therefore is interested by single side testing.

Other ways of testing by choosing the minimal effect size and direction (tails) are proposed.

  • greaterAbs - |beta| > lfcThreshold - tests are two-tailed
  • lessAbs - |beta| < lfcThreshold - p values are the maximum of the upper and lower tests
  • greater - beta > lfcThreshold
  • less - beta < lfcThreshold

ddsNoPrior <- DESeq(dds, betaPrior=FALSE)
# compute four tests
resGA <- results(dds, lfcThreshold=.5, altHypothesis="greaterAbs")
resLA <- results(ddsNoPrior, lfcThreshold=.5, altHypothesis="lessAbs")
resG <- results(dds, lfcThreshold=.5, altHypothesis="greater")
resL <- results(dds, lfcThreshold=.5, altHypothesis="less")
# plot MA for each test
yl <- c(-2.5,2.5)
plotMA(resGA, ylim=yl)
plotMA(resLA, ylim=yl)
plotMA(resG, ylim=yl)
plotMA(resL, ylim=yl)



A Magic Moment with RStudio

If all your code validated during the manual run, check what comes after clicking on Knit PDF (PDF)

Mortasecca.png Warning: Now you will not say that the trainer is a command-line addict anymore. RStudio IS a GREAT piece of GUI that 'almost' makes R 'easy' ... !!

Acknowledging R

This exercise was possible thanks to the DESeq2 vignette and to many packages listed below


R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
 [1] gplots_2.16.0             RColorBrewer_1.1-2        vsn_3.34.0              
 [4] Biobase_2.26.0            ggplot2_1.0.0             DESeq2_1.6.3            
 [7] RcppArmadillo_0.4.650.1.1 Rcpp_0.11.5               GenomicRanges_1.18.4    
[10] GenomeInfoDb_1.2.4        IRanges_2.0.1             S4Vectors_0.4.0          
[13] BiocGenerics_0.12.1       stringr_0.6.2             corrgram_1.7            

loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3       affy_1.44.0           affyio_1.34.0         annotate_1.44.0      
 [5] AnnotationDbi_1.28.1  base64enc_0.1-2       BatchJobs_1.5         BBmisc_1.9          
 [9] BiocInstaller_1.16.2  BiocParallel_1.0.3    bitops_1.0-6          brew_1.0-6          
[13] caTools_1.17.1        checkmate_1.5.1       cluster_2.0.1         codetools_0.2-11    
[17] colorspace_1.2-6      DBI_0.3.1             digest_0.6.8          fail_1.2            
[21] foreach_1.4.2         foreign_0.8-63        Formula_1.2-0         gclus_1.3.1          
[25] gdata_2.13.3          genefilter_1.48.1     geneplotter_1.44.0    grid_3.1.3          
[29] gtable_0.1.2          gtools_3.4.1          Hmisc_3.15-0          iterators_1.0.7      
[33] KernSmooth_2.23-14    lattice_0.20-30       latticeExtra_0.6-26   limma_3.22.7        
[37] locfit_1.5-9.1        MASS_7.3-39           munsell_0.4.2         nnet_7.3-9          
[41] plyr_1.8.1            preprocessCore_1.28.0 proto_0.3-10          reshape2_1.4.1      
[45] rpart_4.1-9           RSQLite_1.0.0         scales_0.2.4          sendmailR_1.2-1      
[49] seriation_1.0-14      splines_3.1.3         survival_2.38-1       tools_3.1.3          
[53] TSP_1.0-10            XML_3.98-1.1          xtable_1.7-4          XVector_0.6.0        
[57] zlibbioc_1.12.0

download exercise files

Download exercise files here.

Use the right application to open the files present in ex5-files


[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.4 | NGS_RNASeq_DE_Exercise.6 ]