The rnaseqGene DESeq2 R-workflow

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.4 | NGS_RNASeq_DE_Exercise.6 ]

1 Introduction
2 Standard Workflow (simplified!)
3 Data Transformation and Visualization
- 3.1 Count data transformation
  - 3.1.1 Regularized log transformation
  - 3.1.2 Variance stabilizing transformation
- 3.2 Data quality assessment by sample clustering and visualization
  - 3.2.1 Heatmap of the sample-to-sample distances
  - 3.2.2 Principal component plot of the samples
4 Filtering and testing to create candidate DE gene lists
- 4.1 Tests of log2 fold change above or below a threshold
5 A Magic Moment with RStudio
6 Acknowledging R
7 download exercise files

Introduction

The dataset used in this session was also selected by the authors of the DESeq2 package for their vignette^[1]. We will therefore reproduce below key steps of the detailed DESeq2 training document to quickly perform them together. It is strongly suggested that you fully read the original document in order to understand each command and the rationale behind it.

Various bioconductor files related to DESeq2 are often updated and should better be retrieved at the time of repeating this training from the bioconductor repository^[2].

An RNA-Seq workflow rnaseqGene on the Bioconductor website^[3] (formerly the Beginner’s Guide PDF), covers similar material to this vignette but at a slower pace, including the generation of count matrices from FASTQ files.

If you refer to the http://www.bioconductor.org/help/workflows/rnaseqGene/ document, you may jump Starting from SummarizedExperiment to match step into the part related to this exercise.

The code below was almost literally copied from the published pages or adapted to match the previously created data

Standard Workflow (simplified!)

Mapping with tophat2 followed by read count using HTseq led to a number of count files (one per sample) that were saved to a local folder. We here take over in RStudio by loading this data in and using it to construct a large count table. The count table contains the number of sequence fragments that were assigned by the mapping to a certain gene and this for all genes in all samples.

Install and load required R-packages

These libraries are required in the current exercise.

library("DESeq2")
library("ggplot2")
library("vsn")
library("RColorBrewer")
library("gplots")
library("reshape2")

Locate and load data

Metadata

We create a metadata table that will help ordering, merging and comparing the results. Files have been created by HTSeq-count or STAR and can be found on the BITS server.

The goal of the analysis is to find DE genes (genes with different expression levels in the conditions that are being compared. Typically you have two groups of samples, one consisting of bological replicates that have received a control treatment, the other consisting of biological replicates that received a specific treatment. [R] needs to know the metadata of the experiment i.e. which data file belongs to which group. The metadata will then be used to assemble the groups and compute differential expression.

In the experiment that we are analyzing the data consists of four groups (treatments):

The dex group: RNA samples from 4 cell lines after treatment with the glucocorticoid dexamethasone (dex), used as astma medication
The alb group: RNA samples from the same cell lines after treatment with albuterol (alb), another common astma medication
The alb_dex group: RNA samples from the same cell lines after treatment with both astma medications
The untreated group: RNA samples from the same four untreated cell lines cultured in parallel.

So all samples come from the same 4 cell lines (cells).

# run_accession read_count samples cells treatment
1 SRR1039508 22935521 N61311_untreated N61311 untreated
2 SRR1039509 21155707 N61311_Dex N61311 Dex
3 SRR1039510 22852619 N61311_Alb N61311 Alb
4 SRR1039511 21938637 N61311_Alb_Dex N61311 Alb_Dex
5 SRR1039512 28136282 N052611_untreated N052611 untreated
6 SRR1039513 43356464 N052611_Dex N052611 Dex
7 SRR1039514 40021727 N052611_Alb N052611 Alb
8 SRR1039515 29454748 N052611_Alb_Dex N052611 Alb_Dex
9 SRR1039516 30043024 N080611_untreated N080611 untreated
10 SRR1039517 34298260 N080611_Dex N080611 Dex
11 SRR1039518 30441200 N080611_Alb N080611 Alb
12 SRR1039519 31533740 N080611_Alb_Dex N080611 Alb_Dex
13 SRR1039520 34575286 N061011_untreated N061011 untreated
14 SRR1039521 41152075 N061011_Dex N061011 Dex
15 SRR1039522 28406812 N061011_Alb N061011 Alb
16 SRR1039523 31099538 N061011_Alb_Dex N061011 Alb_Dex

Metadata for our analysis

We will only do the comparison between dex and untreated samples, so the metadata file on the bits laptops only contains metadata for these samples.

The 'metadata' table is found in the RNASeq folder in the Documents folder on the BITS laptops. It is loaded into RStudio and used to build a design table

#set working directory####
#this is the folder where R will look for data and write files
#please adapt this location to match your own computer
basedir <- "C:/Users/VIB/Documents/RNASeq/"
setwd(basedir)
 
#import metadata####
#this is a description of the samples and the groups you want to compare
#see http://wiki.bits.vib.be/index.php/NGS_RNASeq_DE_Exercise.5#Locate_and_load_data
#the metadata file is in the folder you have set as working directory
#you have to specify that the first row contains no data but column labels 
metadata <- read.table("GSE52778_metadata.txt", header = TRUE)
metadata
 
#add a column to the metadata table containing the names of the count table files
metadata$sampleFiles <- paste( metadata$run_accession, "_all_counts.txt", sep="")
metadata
 
#create a sample table in the format DEseq2 wants it####
#the sample table is a data frame (a special type of table in R) consisting of 4 columns:
#1. names of the runs
#2. names of the files containing the counts
#3. names of the cell lines
#4. treatment
sampleTable <- data.frame(sampleName = metadata$run_accession, 
                          fileName = metadata$sampleFiles, 
                          cells = metadata$cells, 
                          treatment = metadata$treatment)
 
sampleTable

The result is a table with these 4 columns and 8 rows, each row representing one of the samples that we want to compare.

sampleName fileName cells treatment
1 SRR1039508 SRR1039508_all_counts.txt N61311 untreated
2 SRR1039509 SRR1039509_all_counts.txt N61311 Dex
3 SRR1039512 SRR1039512_all_counts.txt N052611 untreated
4 SRR1039513 SRR1039513_all_counts.txt N052611 Dex
5 SRR1039516 SRR1039516_all_counts.txt N080611 untreated
6 SRR1039517 SRR1039517_all_counts.txt N080611 Dex
7 SRR1039520 SRR1039520_all_counts.txt N061011 untreated
8 SRR1039521 SRR1039521_all_counts.txt N061011 Dex

Loading and locating the counts

Load the merged HTSeq data into a DESeq2 object

DESeq2^[4] is one of the most popular analysis package for RNASeq data today. Its normalization model is a negative binomial model and differs from the original DESeq. This model is considered well suited for modeling counts (discrete variable) obtained from HTSeq.

Since the 4 samples in each group come from the same 4 cell lines, the data sets are considered paired: the data in each group is related to the data in the other groups. This means that we have to include the pairing in the test for differential expression as this will allow taking into account the genetic background shared across observations of the same line.

The DESeq2 package has a specific method to read in individual HTSeq count files summarized in the sampleTable. We use this method here to be consistent and use the sampleTable created previously. To read in the files you just have to specify the folder that contains them. On the bits laptops the files are in the htseq_counts folder in the working directory.

In the metadata, the pairing was represented in the cells column whereas the treatment was represented in the treatment column.

This is why the design is defined as ~ cells + treatment. This is the way to represent the design of a paired experiment in R: ~ means "is modeled by" and is followed by all factors (both pairing and grouping factors). If you have more than one factor (as we have since we have a pairing and a grouping factor) you combine them using the + sign. In the case of DESeq2 the pairing variable has to be specified as the first factor whereas the grouping variables should go at the end of the design formula.

This design is essentially a matrix that provides a representation of the different RNA samples, telling R which samples you want to compare. If the data was unpaired and you just wanted to compare the two or more treatments the design would have been ~ treatment.

If you have more than one grouping variable, e.g. you measured each sample on two time points then your design would be ~ cells + treatment + time. This model assumes that the effect of the treatment is the same on the two time points. If you think that this is not the case you should add an interaction term to the design: ~ cells + treatment + time + treatment:time.

More info on defining designs can be found here. Defining the design for complex studies with multiple factors can be quite challenging, we advise you to consult a statistician or a bioinformatician to check your design.

#importing the data####
#Define the folder that contains the files so DESeq2 can find them.
folder <- "htseq_counts"
 
#Define the sampleTable created previously 
#so R DESeq2 knows the names of the files to load and the groups to compare.
#Define the design: tell DESeq2 how you want to do the comparison
#see wiki for more details on the design
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                  directory = folder,
                                  design = ~ cells + treatment)
 
#view the created object
dds

Inspection of the DESeqDataSet

This creates a DESeqDataSet, a complex data type in DESeq2 that holds both counts table and metadata. The counts table has counts for 57773 genes in 8 samples.

class: DESeqDataSet
dim: 57773 8
metadata(1): version
assays(1): counts
rownames(57773): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
ENSG00000273493
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(2): cells treatment

To view the annotation of the columns of the count table you can use the colData function.

# column information
colData(dds)

DataFrame with 8 rows and 2 columns
cells treatment
<factor> <factor>
SRR1039508 N61311 untreated
SRR1039509 N61311 Dex
SRR1039512 N052611 untreated
SRR1039513 N052611 Dex
SRR1039516 N080611 untreated
SRR1039517 N080611 Dex
SRR1039520 N061011 untreated
SRR1039521 N061011 Dex

Factors are categorical variables (non-numerical variables). They have to be transformed into factors, otherwise R can't process them. By default, R handles the levels (names of the categories) in alphabetical order, meaning that R would use Dex as the reference treatment (since D comes first in the alphabet compared to u). We want to use untreated as reference so we have to relevel the treatment factor.

# relevel to get untreated as reference
dds$treatment <- relevel(dds$treatment, "untreated")

Differential expression analysis

The DESeq function is a wrapper that performs a default analysis through four steps:

estimation of size factors
estimation of dispersions by Negative Binomial GLM fitting
calculation of log fold changes
testing for differential expression using the Wald test

Calculation of dispersions

Dispersions are measures of the biological variation in each group. They are estimated by fitting a negative binomial model to the raw counts of biological replicates.

# run the combined DESeq2 analysis
dds <- DESeq(dds)
 
# estimating size factors
# estimating dispersions
# gene-wise dispersion estimates
# mean-dispersion relationship
# final dispersion estimates
# fitting model and testing
 
# Dispersion plot and fitting alternatives
plotDispEsts(dds)

In the plot, dispersion estimates are plotted versus the mean normalized count of each gene (black dots). A curve (red) is fit to the resulting data cloud to capture the overall trend of dispersion-mean dependence. This curve is used to 'shrink' the dispersion coefficients towards the consensus represented by the red curve (blue points). The black points circled in blue are detected as dispersion outliers and not shrunk toward the curve. This shrinkage assumes that genes with similar expression levels should have similar dispersion.

Calculation of size factors

Size factors are related to library size normalization. They are estimated as follows:

Take the geometric mean over all samples for each gene and use that as the reference expression data set.
For each sample, get a list of quotients of each gene expression value to its reference expression.
The median of each condition quotient list is the normalization factor (=size factor) for that data set.

# get size factors used for normalization
sizeFactors(dds)

SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
1.0123484 0.8951975 1.1747041 0.6739964 1.1729844 1.4049307 0.9193101 0.9527669

Calculation of log fold changes

Next DESeq calculates log2 fold changes (log2FoldChange) using the normalized counts. A problem occurring in all RNA-Seq data sets is the fact that genes with low read counts have unrealistically high log fold changes. This phenomenon is a direct consequence of dealing with count data, in which ratios are inherently noisier when counts are low. DESeq handles this problem by shrinking log fold changes toward zero in a manner such that shrinkage is stronger when counts are low or dispersion is high. This is done by using a regularized logarithm transformation (rlog), which behaves similarly to a log2 transformation for genes with high counts, while shrinking the values for genes with low counts. To do an rlog transformation you should use the rlog() function on the raw dds object (so without running the DESeq() function on it). For a detailed description see section 3 of this page.

rld <- rlog(dds, blind=FALSE)

The DESeq2 developers state that the rlog transformation should be used for applications other than differential testing, for example clustering of samples or other machine learning applications. For differential testing they recommend the DESeq() function applied to raw counts. Since we are interested in DE we are going to do the latter.

Estimation of differential expression

The results file is large containing results for all 55773 genes see it is not a good idea to look at the full results table, use head() instead. Head() only shows the first rows of the file, you can specify the number of rows to show.

#store results into a new object
res <- results(dds)
 
#results, 4 first rows
str(res)
head(res, 4)

This shows a file containing 6 columns

log2 fold change (MAP): treatment Dex vs untreated
Wald test p-value: treatment Dex vs untreated
DataFrame with 4 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003 698.9255 -0.38812002 0.1001493 -3.8754129 0.0001064441 0.0008072674
ENSG00000000005 0.0000 NA NA NA NA NA
ENSG00000000419 475.4853 0.21128035 0.1131837 1.8667038 0.0619429686 0.1695057180
ENSG00000000457 251.9015 0.01368663 0.1325385 0.1032653 0.9177523856 0.9590952696

The lfcSE column contains the standard errors of the log2FoldChange estimates.

DESeq performs the Wald test for differential expression, one test for each gene. The Wald test divides the estimate of the log fold change (log2FoldChange) by its standard error (lfcSE), resulting in the Wald statistic (stat). The statistic is compared to a standard normal distribution, resulting in a p-value (pvalue).

Since we perform multiple tests (one for each gene) on the same data set, the p-values must be adjusted for multiple testing. Adjustment is done using Benjamini and Hochberg's method to control the false discovery rate (FDR). The meaning of the adjusted p-value (padj) is as follows: if you select all genes with adjusted p-value below 0.05 as DE, then the expected proportion of false discoveries in the selected group is less than 5% (= FDR). So if you have 1000 DE genes in a total of 20000 genes, 50 of them will be false positives (5% of 1000) and will in reality not be DE.

When you report the results (list of DE genes) you should report the adjusted p-values !

Setting thresholds for differential expression

Some genes contain missing values (represented by NA). These are genes with zero counts in one or both groups of samples. When calculating the log fold change this gives an error since you cannot compute the log of zero. You can remove these genes from the results table by checking if the adjusted p=value equals NA using the is.na() function.

# keep only data with computed padj
resFilt <- res[!is.na(res$padj),]

Check the first 4 rows of the filtered results to see if the gene with the missing p-values was removed.

log2 fold change (MAP): treatment Dex vs untreated
Wald test p-value: treatment Dex vs untreated
DataFrame with 4 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003 698.92549 -0.38812002 0.1001493 -3.8754129 0.0001064441 0.0008072674
ENSG00000000419 475.48532 0.21128035 0.1131837 1.8667038 0.0619429686 0.1695057180
ENSG00000000457 251.90150 0.01368663 0.1325385 0.1032653 0.9177523856 0.9590952696
ENSG00000000460 50.84756 -0.02705074 0.2608112 -0.1037177 0.9173933898 0.9589720870

The results file contains the genes in original order, it might be better to reorder them putting the most significant genes (lowest p-value) at the top.

#reorder the results by decreasing significance
resOrdered <- res[order(res$padj),]

Check the first 6 rows of the ordered results to see if the order of the genes was changed.

baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000165995 514.2841 3.321662 0.1307366 25.40728 2.095444e-142 3.170407e-138
ENSG00000152583 985.5593 4.340812 0.1760858 24.65169 3.529480e-134 2.670051e-130
ENSG00000120129 3325.4027 2.873150 0.1167734 24.60448 1.131211e-133 5.705074e-130
ENSG00000101347 13616.9348 3.606558 0.1517550 23.76566 7.569343e-125 2.863104e-121
ENSG00000189221 2294.7300 3.231983 0.1396172 23.14888 1.491976e-118 4.514719e-115
ENSG00000211445 12162.4869 3.540681 0.1573440 22.50281 3.895921e-112 9.824214e-109

Setting different thresholds for DE

#select genes with adjusted p-value < 0.1 that are at least two-fold DE
resFilt[abs(resFilt$log2FoldChange)>=1 & (resFilt$padj < 0.1),]

Select genes with adjusted p-value < 0.05 and store in an object called resSig

More information on results columns: information about which variables and tests were used can be found by calling the function mcols() on the results object.

mcols(res)$description
 
# [1] "mean of normalized counts for all samples"
# [2] "log2 fold change (MAP): treatment Dex vs untreated"
# [3] "standard error: treatment Dex vs untreated"
# [4] "Wald statistic: treatment Dex vs untreated"
# [5] "Wald test p-value: treatment Dex vs untreated"
# [6] "BH adjusted p-values"

Detection outliers and filtering genes with very low counts

# We can summarize some basic statistics using the summary function.
summary(res)

This lists the number of genes that are considered up- and downregulated according to the Wald test.

DESeq uses the Cook's distance to detect outliers that do not fit the distributional assumptions of the model. An example of such an outlier would be a gene with single-digit counts for all samples, except one sample with a count in the thousands. As you can see in our data set no outliers were detected.

Additionally, DESeq filters out genes with very low counts (1 or 2 in a few samples) to reduce the number of tests that needs to be performed. This is why genes that have little or no chance of being detected as DE (genes with very low counts) are omitted from the testing. DESeq2 uses the average expression level of each gene, across all samples, as filter criterion, and it omits all genes with mean normalized counts below a filtering threshold. DESeq2 by default will choose a threshold that maximizes the number of genes found at a user-specified target FDR.

out of 28730 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)   : 2467, 8.6%
LFC < 0 (down)   : 2155, 7.5%
outliers [1]   : 0, 0%
low counts [2]   : 11602, 40%
(mean count < 4)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

As you can see the FDR is set to 10% in the summary. If you think this is too high you can make the summary more stringent.

## being more stringent
summary(res, alpha=0.01)

out of 28730 with nonzero total read count
adjusted p-value < 0.01
LFC > 0 (up)   : 1562, 5.4%
LFC < 0 (down)   : 1284, 4.5%
outliers [1]   : 0, 0%
low counts [2]   : 13600, 47%
(mean count < 11.2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

How many genes are up- and down-regulated when you apply an FDR of 0.1% ?

Plotting results

MA plots using the plotMA command

MA plots show the estimated log fold changes (Y-axis) over average expression level (X-axis) with red points indicating genes with low adjusted p-values (DE genes).

# MA-plot
# legend: Points will be colored red if the adjusted p value is less than 0.1. 
#  Points which fall out of the window are plotted as open triangles pointing either up or down.
plotMA(res, main="MAplot after DESeq2 analysis", ylim=c(-2,2))

MA plots using ggplot2

It is possible to create the MA-plot in ggplot2, this gives you freedom to define the plot exactly as you want it. However, ggplot2 can't handle DESeqResults objects, it uses a simple data frame as an input.

#ggplot2 can't handle DESeqResults objects
ggplot(res, aes(x=baseMean,y=log2FoldChange)) + geom_point()

This gives the following error:

Error: ggplot2 doesn't know how to deal with data of class DESeqResults

To resolve this we need to transform the DESeqResults object into a data frame using the as.data.frame() function.

results <- as.data.frame(res)
head(results)

Create a default scatter plot of the data using ggplot()

This gives a plot that looks nothing like the plot generated by plotMA.

First we'll fix the X-axis. To make it alike the plotMA() plot, we have to use a log10 scale on the X-axis and set its range to 0,1 to 300000.

#default scatter plot of the data
ggplot(results, aes(x=baseMean,y=log2FoldChange)) + geom_point() + scale_x_continuous(trans="log10",limits=c(0.1,300000))

This looks a lot better

Now colour the data points with low adjusted p-values (< 0.1) red

ggplot(results, aes(x=baseMean,y=log2FoldChange)) + geom_point(aes(colour=padj<0.1)) +
  scale_colour_manual(name='padj < 0.1',values=setNames(c('red','black'),c(T,F))) +
  scale_x_continuous(trans="log10",limits=c(0.1,300000))

You see that you have lost some points especially in the region with small X-values. These are points with missing adjusted p-values. plotMA keeps these points in the plot coloring them black. The easiest way to fix this if to replace the missing p-values by 0.99

Create a new variable resFixed, a copy of results. In resFixed replace missing values by 0.99 and make the plot again

Now the plot looks much better

Downsize the dots to size 0.5

ggplot(resFixed, aes(x=baseMean,y=log2FoldChange)) + geom_point(aes(colour=padj<0.1),size=0.5) +
  scale_colour_manual(name='padj < 0.1',values=setNames(c('red','black'),c(T,F))) +
  scale_x_continuous(trans="log10",limits=c(0.1,300000))

It's starting to become a very nice plot

The MA plot contains a red line that fits the data points and is obtained via non-linear regression using Loess smoothing. Loess divides the data set in the local subsets of points with similar x-values (i.e. close to each other on the plot). It fits a simple linear model to these subsets overall generating a non-linear curve consisting of small linear parts.
Add a red smoother line to the plot

ggplot(resFixed, aes(x=baseMean,y=log2FoldChange)) + geom_point(aes(colour=padj<0.1),size=0.5) +
  scale_colour_manual(name='padj < 0.1',values=setNames(c('red','black'),c(T,F))) +
  scale_x_continuous(trans="log10",limits=c(0.1,300000)) + geom_smooth(colour="red")

The MA plot contains a blue horizontal line which serves as a reference to the red smoother line. In theory the red curve and the blue line should coincide. Sine you expect that there are equal amounts of genes with high and genes with low log fold changes, the red curve (representing the average gene) should coincide with log fold change = 0 (the blue horizontal line Y = 0).
Add a blue horizonzal line to the plot

ggplot(resFixed, aes(x=baseMean,y=log2FoldChange)) + geom_point(aes(colour=padj<0.1),size=0.5) + 
  scale_colour_manual(name='padj < 0.1',values=setNames(c('red','black'),c(T,F))) +
  scale_x_continuous(trans="log10",limits = c(0.1,300000)) + geom_smooth(colour="red") +
  geom_abline(slope=0,intercept=0,colour="blue")

Plot counts using the plotCounts() function

#retrieve the name of the best scoring gene (with the lowest padj)
#gene names are used as rownames in the results table
onegene <- rownames(res)[which.min(res$padj)]
 
#plot the counts for this gene
plotCounts(dds,gene=which.min(res$padj),intgroup="treatment",main=onegene)

You can also plot the counts of a gene by using its gene name

plotCounts(dds, gene="ENSG00000000003", intgroup="treatment", main="ENSG00000000003")

Plot the counts of gene ENSG00000154734

You can also plot counts for each cell line instead of comparing between treatments

plotCounts(dds, gene=which.min(res$padj), intgroup="cells", main=onegene)

Plot the counts for each cell line of gene ENSG00000154734

Plot counts using ggplot2

An alternative way to transform a complex data type like a DESeqResults object (like res) or a DESeqDataSet (like dds) into a data frame that ggplot2 can interpret is to store the results of the plot functions into a variable

d <- plotCounts(dds,gene=which.min(res$padj),intgroup="treatment",main=onegene,returnData=TRUE)
class(d)

As you can see d is a data frame.

Plot the counts of d in a default scatter plot.

Use a log10 transformed Y-scale.

Set tick marks on the Y-axis at 25 (start plot), 100 and 400

ggplot(d,aes(x=treatment,y=count)) + geom_point() + scale_y_continuous(trans="log10",breaks=c(25,100,400))

Add the name of the gene as the title to the plot

ggplot(d, aes(x=treatment, y=count)) + geom_point() + scale_y_continuous(trans="log10",breaks=c(25,100,400)) + 
  labs(title=onegene)

If you want to colour the dots according to the cell line they comre from, you need to store this information into the d data frame. When we created d, we only stored info about the treatment by defining intgroup="treatment".

Create d but this time include info about treatment and cell line ?
d <- plotCounts(dds,gene=which.min(res$padj),intgroup=c("treatment","cells"),main=onegene,returnData=TRUE)

Plot the counts of d in a scatter plot with log transformed Y axis, tick marks and title. Colour the dots according to cell line.

Plot distribution of log fold changes using the hist() function

We will plot the distribution of the log2 fold changes of all genes that are considered DE according to the following threshold: padj < 0.05

#selection of these genes was already done: they were stored in resSig
#hist() plots a frequency distribution 
#xlim sets the range of the X-axis
#breaks specifies number of bins in the histogram
hist(resSig$log2FoldChange,xlim=c(-6,6),breaks=20,main="LFC distribution for genes with padj<0.05")

As you see in the plot, many DE genes have rather low log fold changes. Typically, scientists only report DE genes with minimally 2-fold, 3-fold or 4-fold changes.

Select genes from resFixed with padj < 0.01 and at least two-fold DE, and store in a variable calles resString.
resString <- resFixed[abs(resFixed$log2FoldChange)>=1 & (resFixed$padj < 0.01),] head(resString)

Make a histogram of log fold changes of these genes.

Plot distribution of log fold changes using ggplot2

Create a default histogram of the log fold changes of resString

#default histogram
ggplot(resString,aes(x=log2FoldChange)) + geom_histogram()

Make a histogram with 20 bins

#default histogram
ggplot(resString,aes(x=log2FoldChange)) + geom_histogram(bins=20)

Set the range X-axis from -6 to 6

Volcano plot using the plot() function

Next we create a Volcano plot for the DE genes. The Volcano plot arranges the DE genes along dimensions of biological and statistical significance. The X-axis represents the log fold change between the two groups, and the Y-axis represents the adjusted p-value (on an inverse scale so smaller p-values appear higher up). The X-axis indicates biological relevance; the Y-axis indicates the statistical evidence, or reliability of the change in expression. The researcher can make judgements about the most promising candidates for follow-up studies, by trading off both these criteria by eye.

# Volcano plot for the subset with padj < 0.05
plot(resSig$log2FoldChange, 1-resSig$padj,xlim=c(-6,6),main="volcano plot for genes with padj<0.05")

Make the same plot but now for genes with 2-fold changes or higher and padj < 0.01

Volcano plot using ggplot2

Create a default scatter plot of the resString data

Use hollow points

# Volcano plot for the subset with padj < 0.05
ggplot(resString,aes(x=log2FoldChange,y=padj)) + geom_point(shape=1)

Invert the Y-axis

# Volcano plot for the subset with padj < 0.05
ggplot(resString,aes(x=log2FoldChange,y=padj)) + geom_point(shape=1) + scale_y_reverse()

Heat map using the heatmap() function

We will create a heat map showing the normalized counts of the 20 most significant genes (genes with the lowest padj).

#specify a color palette
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
 
#select normalized counts for 20 most significant genes
#view 20 most significant genes
resOrdered[1:20,]
 
#view gene names of 20 most significant genes
rownames(resOrdered[1:20,])
 
#select rows from dds of these genes
select <- dds[rownames(resOrdered[1:20,]),]
 
#show heat map of normalized counts for these genes
heatmap.2(counts(select,normalized=TRUE), col = hmcol, Rowv = FALSE, Colv = FALSE,
          scale="none", dendrogram="none", trace="none", margin=c(10,6))

Make the same plot for the 10 most significant genes.

Heat map using ggplot2

Retrieve the counts of the 10 selected genes ?
data <- counts(select,normalized=T)

We need to reformat the data to make a tile plot. Now the data is in the following format:

data

                SRR1039508 SRR1039509  SRR1039512 SRR1039513
ENSG00000165995   76.06077   861.2625   114.07128   1275.971
ENSG00000120129  637.13244  5499.3451  1004.50828   6608.344
ENSG00000152583   57.29253  2267.6560    80.02015   1724.045
ENSG00000101347 1700.00763 20487.0991  1913.67340  31566.934

This is called wide format. However, ggplot expects data in long format like this:

1  ENSG00000165995 SRR1039508    76.06077
2  ENSG00000120129 SRR1039508   637.13244
3  ENSG00000152583 SRR1039508    57.29253
4  ENSG00000101347 SRR1039508  1700.00763

To transform data from wide to long format we can use the melt() function of the reshape2 package:

datamelted <- melt(data)
datamelted

This indeed gets the data in the right format

> datamelted
              Var1       Var2       value
1  ENSG00000165995 SRR1039508    76.06077
2  ENSG00000120129 SRR1039508   637.13244

Now make the default tile plot with ggplot2

ggplot(datamelted,aes(x=Var2,y=Var1)) + geom_tile(aes(fill=value))

To change the colors of the color gradient, you can define a color for low values and a color for high values:

ggplot(datamelted,aes(x=Var2,y=Var1)) + geom_tile(aes(fill=value)) + scale_fill_gradient(low="white",high="steel blue")

Color the border of the tiles white

The X-axis is still pretty messy with overlapping sample names. To print the sample names vertically use a theme element:

ggplot(datamelted,aes(x=Var2,y=Var1)) + geom_tile(aes(fill=value),colour="white") + 
    scale_fill_gradient(low="white",high="steel blue") + theme(axis.text.x=element_text(angle=90))

Exporting results to HTML or CSV files

We use write.csv() to write a data frame to a csv (comma separated values) file. The file we want to save is the ordered results: resOrdered. We have to check first if this is a data frame using the class() function.

class(resOrdered)
write.csv(as.data.frame(resOrdered), file="Dex_vs_untreated_results.csv")

Since it's not a data frame we transform it into a data frame using as.data.frame(). Using the file argument we provide a name for the csv file. The write.csv() will now write the resOrdered data frame to a file called Dex_vs_untreated_results.csv in your working directory. If you want to save it elsewhere you have to provide a path on your computer instead of a file name.

Data Transformation and Visualization

Data transformation can be applied in different ways to compensate for the dispersion at low and high expression ranges. These methods were developed after the first wave of RNASeq analysis to try compensate for effects associated with this type of data. Remember that weakly expressed genes show much higher log fold changes than strongly expressed genes. This phenomenon, seen in most RNASeq datasets, is a direct consequence of dealing with count data, in which ratios are inherently noisier when counts are low.

Therefore, it is useful to transform data to diminish this effect. If data are used on the original count scale, the result will be dominated by highly expressed, highly variable genes; if logarithm-transformed data are used, undue weight will be given to weakly expressed genes, which show exaggerated LFCs, as discussed above. Therefore, DESeq2 implements a regularized logarithm transformation (rlog), which behaves similarly to a log2 transformation for genes with high counts, while shrinking the values for genes with low counts.

Count data transformation

We present here two log fold change shrinkage methods used in DESq2:

Regularized log transformation
Variance stabilizing transformation

Regularized log transformation

# Regularized log transformation
rld <- rlog(dds)
rld

class: SummarizedExperiment
dim: 57773 8
exptData(0):
assays(1): ''
rownames(57773): ENSG00000000003 ENSG00000000005 ... ENSG00000273492 ENSG00000273493
rowData metadata column names(47): baseMean baseVar ... dispFit rlogIntercept
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(3): cells treatment sizeFactor

Variance stabilizing transformation

# Variance stabilizing transformation
vsd <- varianceStabilizingTransformation(dds)
vsd

class: SummarizedExperiment
dim: 57773 8
exptData(0):
assays(1): ''
rownames(57773): ENSG00000000003 ENSG00000000005 ... ENSG00000273492 ENSG00000273493
rowData metadata column names(46): baseMean baseVar ... dispGeneEst dispFit
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(3): cells treatment sizeFactor

# create matrix for saving or further use
rlogMat <- assay(rld)
head(rlogMat)

SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
ENSG00000000003 9.385815 9.111922 9.485295 9.293740 9.745669 9.502458 9.584332
ENSG00000000005 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSG00000000419 8.790046 9.016464 8.887312 8.928876 8.843568 8.958822 8.761465
ENSG00000000457 7.979714 7.978258 7.915768 7.997564 7.953955 7.986459 8.036356
ENSG00000000460 5.702651 5.735228 5.226608 5.562887 5.683232 5.454195 5.909199
ENSG00000000938 -2.199605 -2.199305 -2.164748 -2.198557 -2.164706 -2.200320 -2.199371
SRR1039521
ENSG00000000003 9.285416
ENSG00000000005 0.000000
ENSG00000000419 8.921048
ENSG00000000457 7.958014
ENSG00000000460 5.743980
ENSG00000000938 -2.199459

# create matrix for saving or further use
vstMat <- assay(vsd)
head(vstMat)

SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
ENSG00000000003 9.434437 9.016021 9.582868 9.294533 9.964458 9.608177 9.729533
ENSG00000000005 4.063874 4.063874 4.063874 4.063874 4.063874 4.063874 4.063874
ENSG00000000419 8.837919 9.175165 8.983583 9.045680 8.918437 9.089329 8.794405
ENSG00000000457 8.156567 8.154439 8.062614 8.183129 8.118785 8.166373 8.239793
ENSG00000000460 6.417750 6.465505 5.723628 6.213135 6.389237 6.066048 6.710763
ENSG00000000938 4.063874 4.063874 4.388676 4.063874 4.388913 4.063874 4.063874
SRR1039521
ENSG00000000003 9.282765
ENSG00000000005 4.063874
ENSG00000000419 9.033742
ENSG00000000457 8.124620
ENSG00000000460 6.477112
ENSG00000000938 4.063874

Data quality assessment by sample clustering and visualization

Clustering is often used to show sample segregation and identify co-expressed genes across groups. These analyses are done on rlog transformed data as recommended by the developers. Example code below can be adapted to fit your needs and create publication-ready figures with some efforts and patience.

Heatmap of the sample-to-sample distances

Error creating thumbnail: Unable to save thumbnail to destination

Note the 'elegant' R trick to change names from unreadable IDs to something much nicer

# we need to transpose the data first with t()
distsRL <- dist(t(assay(rld)))
 
# store in to matrix
mat <- as.matrix(distsRL)
 
# rename samples
rownames(mat) <- colnames(mat) <- with(colData(dds),
          paste(treatment, cells, sep=" : "))
 
# cluster pairwise
hc <- hclust(distsRL)
 
# plot heatmap
heatmap.2(mat,
          Rowv=as.dendrogram(hc),
          symm=TRUE,
          trace="none",
          col = rev(hmcol),
          margin=c(13, 13)
          )

Principal component plot of the samples

# simple plot without grouping
plotPCA(rld, intgroup=c("treatment", "cells"))

# It is also possible to customize the PCA plot using the ggplot function.
data <- plotPCA(rld, intgroup=c("treatment", "cells"), returnData=TRUE)
percentVar <- round(100 * attr(data, "percentVar"))
 
ggplot(data, aes(PC1, PC2, color=treatment, shape=cells)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance"))

Filtering and testing to create candidate DE gene lists

It is now time to restrict the transcriptome to the part that shows apparent differential expression under the assumptions defined by the user. Some cases require specific testing introduced below but for most cases, a simple subsetting of the full transformed table will be enough (see ?subset()).

Tests of log2 fold change above or below a threshold

In some cases, one only wants to know about Up-regulated or Down-regulated genes and therefore is interested by single side testing.

Other ways of testing by choosing the minimal effect size and direction (tails) are proposed.

greaterAbs - |beta| > lfcThreshold - tests are two-tailed
lessAbs - |beta| < lfcThreshold - p values are the maximum of the upper and lower tests
greater - beta > lfcThreshold
less - beta < lfcThreshold

ddsNoPrior <- DESeq(dds, betaPrior=FALSE)
 
# compute four tests
resGA <- results(dds, lfcThreshold=.5, altHypothesis="greaterAbs")
resLA <- results(ddsNoPrior, lfcThreshold=.5, altHypothesis="lessAbs")
resG <- results(dds, lfcThreshold=.5, altHypothesis="greater")
resL <- results(dds, lfcThreshold=.5, altHypothesis="less")
 
# plot MA for each test
par(mfrow=c(2,2),mar=c(2,2,1,1))
yl <- c(-2.5,2.5)
 
plotMA(resGA, ylim=yl)
abline(h=c(-.5,.5),col="dodgerblue",lwd=2)
 
plotMA(resLA, ylim=yl)
abline(h=c(-.5,.5),col="dodgerblue",lwd=2)
 
plotMA(resG, ylim=yl)
abline(h=.5,col="dodgerblue",lwd=2)
 
plotMA(resL, ylim=yl)
abline(h=-.5,col="dodgerblue",lwd=2)

A Magic Moment with RStudio

If all your code validated during the manual run, check what comes after clicking on Knit PDF (PDF)

Warning: Now you will not say that the trainer is a command-line addict anymore. RStudio IS a GREAT piece of GUI that 'almost' makes R 'easy' ... !!

Acknowledging R

This exercise was possible thanks to the DESeq2 vignette and to many packages listed below

sessionInfo()

R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] gplots_2.16.0 RColorBrewer_1.1-2 vsn_3.34.0
[4] Biobase_2.26.0 ggplot2_1.0.0 DESeq2_1.6.3
[7] RcppArmadillo_0.4.650.1.1 Rcpp_0.11.5 GenomicRanges_1.18.4
[10] GenomeInfoDb_1.2.4 IRanges_2.0.1 S4Vectors_0.4.0
[13] BiocGenerics_0.12.1 stringr_0.6.2 corrgram_1.7

loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 affy_1.44.0 affyio_1.34.0 annotate_1.44.0
[5] AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.9
[9] BiocInstaller_1.16.2 BiocParallel_1.0.3 bitops_1.0-6 brew_1.0-6
[13] caTools_1.17.1 checkmate_1.5.1 cluster_2.0.1 codetools_0.2-11
[17] colorspace_1.2-6 DBI_0.3.1 digest_0.6.8 fail_1.2
[21] foreach_1.4.2 foreign_0.8-63 Formula_1.2-0 gclus_1.3.1
[25] gdata_2.13.3 genefilter_1.48.1 geneplotter_1.44.0 grid_3.1.3
[29] gtable_0.1.2 gtools_3.4.1 Hmisc_3.15-0 iterators_1.0.7
[33] KernSmooth_2.23-14 lattice_0.20-30 latticeExtra_0.6-26 limma_3.22.7
[37] locfit_1.5-9.1 MASS_7.3-39 munsell_0.4.2 nnet_7.3-9
[41] plyr_1.8.1 preprocessCore_1.28.0 proto_0.3-10 reshape2_1.4.1
[45] rpart_4.1-9 RSQLite_1.0.0 scales_0.2.4 sendmailR_1.2-1
[49] seriation_1.0-14 splines_3.1.3 survival_2.38-1 tools_3.1.3
[53] TSP_1.0-10 XML_3.98-1.1 xtable_1.7-4 XVector_0.6.0
[57] zlibbioc_1.12.0

download exercise files

Download exercise files here.

Use the right application to open the files present in ex5-files

References: