NGS RNASeq DE Exercise.5
The rnaseqGene DESeq2 R-workflow
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.4 | NGS_RNASeq_DE_Exercise.6 ]
Contents
- 1 Introduction
- 2 Standard Workflow (simplified!)
- 2.1 Install and load required R-packages
- 2.2 Locate and load data
- 2.3 Loading and locating the counts
- 2.4 Differential expression analysis
- 2.5 Plotting results
- 2.5.1 MA plots using the plotMA command
- 2.5.2 MA plots using ggplot2
- 2.5.3 Plot counts using the plotCounts() function
- 2.5.4 Plot counts using ggplot2
- 2.5.5 Plot distribution of log fold changes using the hist() function
- 2.5.6 Plot distribution of log fold changes using ggplot2
- 2.5.7 Volcano plot using the plot() function
- 2.5.8 Volcano plot using ggplot2
- 2.5.9 Heat map using the heatmap() function
- 2.5.10 Heat map using ggplot2
- 2.6 Exporting results to HTML or CSV files
- 3 Data Transformation and Visualization
- 4 Filtering and testing to create candidate DE gene lists
- 5 A Magic Moment with RStudio
- 6 Acknowledging R
- 7 download exercise files
Introduction
The dataset used in this session was also selected by the authors of the DESeq2 package for their vignette[1]. We will therefore reproduce below key steps of the detailed DESeq2 training document to quickly perform them together. It is strongly suggested that you fully read the original document in order to understand each command and the rationale behind it.
Various bioconductor files related to DESeq2 are often updated and should better be retrieved at the time of repeating this training from the bioconductor repository[2].
An RNA-Seq workflow rnaseqGene on the Bioconductor website[3] (formerly the Beginner’s Guide PDF), covers similar material to this vignette but at a slower pace, including the generation of count matrices from FASTQ files.
If you refer to the http://www.bioconductor.org/help/workflows/rnaseqGene/ document, you may jump Starting from SummarizedExperiment to match step into the part related to this exercise.
The code below was almost literally copied from the published pages or adapted to match the previously created data
Standard Workflow (simplified!)
Mapping with tophat2 followed by read count using HTseq led to a number of count files (one per sample) that were saved to a local folder. We here take over in RStudio by loading this data in and using it to construct a large count table. The count table contains the number of sequence fragments that were assigned by the mapping to a certain gene and this for all genes in all samples.
Install and load required R-packages
These libraries are required in the current exercise.
library("DESeq2") library("ggplot2") library("vsn") library("RColorBrewer") library("gplots") library("reshape2")
Locate and load data
Metadata
We create a metadata table that will help ordering, merging and comparing the results. Files have been created by HTSeq-count or STAR and can be found on the BITS server.
The goal of the analysis is to find DE genes (genes with different expression levels in the conditions that are being compared. Typically you have two groups of samples, one consisting of bological replicates that have received a control treatment, the other consisting of biological replicates that received a specific treatment. [R] needs to know the metadata of the experiment i.e. which data file belongs to which group. The metadata will then be used to assemble the groups and compute differential expression.
In the experiment that we are analyzing the data consists of four groups (treatments):
- The dex group: RNA samples from 4 cell lines after treatment with the glucocorticoid dexamethasone (dex), used as astma medication
- The alb group: RNA samples from the same cell lines after treatment with albuterol (alb), another common astma medication
- The alb_dex group: RNA samples from the same cell lines after treatment with both astma medications
- The untreated group: RNA samples from the same four untreated cell lines cultured in parallel.
1 SRR1039508 22935521 N61311_untreated N61311 untreated
2 SRR1039509 21155707 N61311_Dex N61311 Dex
3 SRR1039510 22852619 N61311_Alb N61311 Alb
4 SRR1039511 21938637 N61311_Alb_Dex N61311 Alb_Dex
5 SRR1039512 28136282 N052611_untreated N052611 untreated
6 SRR1039513 43356464 N052611_Dex N052611 Dex
7 SRR1039514 40021727 N052611_Alb N052611 Alb
8 SRR1039515 29454748 N052611_Alb_Dex N052611 Alb_Dex
9 SRR1039516 30043024 N080611_untreated N080611 untreated
10 SRR1039517 34298260 N080611_Dex N080611 Dex
11 SRR1039518 30441200 N080611_Alb N080611 Alb
12 SRR1039519 31533740 N080611_Alb_Dex N080611 Alb_Dex
13 SRR1039520 34575286 N061011_untreated N061011 untreated
14 SRR1039521 41152075 N061011_Dex N061011 Dex
15 SRR1039522 28406812 N061011_Alb N061011 Alb
16 SRR1039523 31099538 N061011_Alb_Dex N061011 Alb_Dex
Metadata for our analysis
We will only do the comparison between dex and untreated samples, so the metadata file on the bits laptops only contains metadata for these samples.
The 'metadata' table is found in the RNASeq folder in the Documents folder on the BITS laptops. It is loaded into RStudio and used to build a design table
#set working directory#### #this is the folder where R will look for data and write files #please adapt this location to match your own computer basedir <- "C:/Users/VIB/Documents/RNASeq/" setwd(basedir) #import metadata#### #this is a description of the samples and the groups you want to compare #see http://wiki.bits.vib.be/index.php/NGS_RNASeq_DE_Exercise.5#Locate_and_load_data #the metadata file is in the folder you have set as working directory #you have to specify that the first row contains no data but column labels metadata <- read.table("GSE52778_metadata.txt", header = TRUE) metadata #add a column to the metadata table containing the names of the count table files metadata$sampleFiles <- paste( metadata$run_accession, "_all_counts.txt", sep="") metadata #create a sample table in the format DEseq2 wants it#### #the sample table is a data frame (a special type of table in R) consisting of 4 columns: #1. names of the runs #2. names of the files containing the counts #3. names of the cell lines #4. treatment sampleTable <- data.frame(sampleName = metadata$run_accession, fileName = metadata$sampleFiles, cells = metadata$cells, treatment = metadata$treatment) sampleTable
The result is a table with these 4 columns and 8 rows, each row representing one of the samples that we want to compare.
1 SRR1039508 SRR1039508_all_counts.txt N61311 untreated
2 SRR1039509 SRR1039509_all_counts.txt N61311 Dex
3 SRR1039512 SRR1039512_all_counts.txt N052611 untreated
4 SRR1039513 SRR1039513_all_counts.txt N052611 Dex
5 SRR1039516 SRR1039516_all_counts.txt N080611 untreated
6 SRR1039517 SRR1039517_all_counts.txt N080611 Dex
7 SRR1039520 SRR1039520_all_counts.txt N061011 untreated
8 SRR1039521 SRR1039521_all_counts.txt N061011 Dex
Loading and locating the counts
Load the merged HTSeq data into a DESeq2 object
DESeq2[4] is one of the most popular analysis package for RNASeq data today. Its normalization model is a negative binomial model and differs from the original DESeq. This model is considered well suited for modeling counts (discrete variable) obtained from HTSeq.
Since the 4 samples in each group come from the same 4 cell lines, the data sets are considered paired: the data in each group is related to the data in the other groups. This means that we have to include the pairing in the test for differential expression as this will allow taking into account the genetic background shared across observations of the same line.
The DESeq2 package has a specific method to read in individual HTSeq count files summarized in the sampleTable. We use this method here to be consistent and use the sampleTable created previously. To read in the files you just have to specify the folder that contains them. On the bits laptops the files are in the htseq_counts folder in the working directory.
In the metadata, the pairing was represented in the cells column whereas the treatment was represented in the treatment column.
This is why the design is defined as ~ cells + treatment. This is the way to represent the design of a paired experiment in R: ~ means "is modeled by" and is followed by all factors (both pairing and grouping factors). If you have more than one factor (as we have since we have a pairing and a grouping factor) you combine them using the + sign. In the case of DESeq2 the pairing variable has to be specified as the first factor whereas the grouping variables should go at the end of the design formula.
This design is essentially a matrix that provides a representation of the different RNA samples, telling R which samples you want to compare. If the data was unpaired and you just wanted to compare the two or more treatments the design would have been ~ treatment.
If you have more than one grouping variable, e.g. you measured each sample on two time points then your design would be ~ cells + treatment + time. This model assumes that the effect of the treatment is the same on the two time points. If you think that this is not the case you should add an interaction term to the design: ~ cells + treatment + time + treatment:time.
More info on defining designs can be found here. Defining the design for complex studies with multiple factors can be quite challenging, we advise you to consult a statistician or a bioinformatician to check your design.
#importing the data#### #Define the folder that contains the files so DESeq2 can find them. folder <- "htseq_counts" #Define the sampleTable created previously #so R DESeq2 knows the names of the files to load and the groups to compare. #Define the design: tell DESeq2 how you want to do the comparison #see wiki for more details on the design dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = folder, design = ~ cells + treatment) #view the created object dds
Inspection of the DESeqDataSet
This creates a DESeqDataSet, a complex data type in DESeq2 that holds both counts table and metadata. The counts table has counts for 57773 genes in 8 samples.
dim: 57773 8
metadata(1): version
assays(1): counts
rownames(57773): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
ENSG00000273493
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(2): cells treatment
To view the annotation of the columns of the count table you can use the colData function.
# column information colData(dds)
cells treatment
<factor> <factor>
SRR1039508 N61311 untreated
SRR1039509 N61311 Dex
SRR1039512 N052611 untreated
SRR1039513 N052611 Dex
SRR1039516 N080611 untreated
SRR1039517 N080611 Dex
SRR1039520 N061011 untreated
SRR1039521 N061011 Dex
Factors are categorical variables (non-numerical variables). They have to be transformed into factors, otherwise R can't process them. By default, R handles the levels (names of the categories) in alphabetical order, meaning that R would use Dex as the reference treatment (since D comes first in the alphabet compared to u). We want to use untreated as reference so we have to relevel the treatment factor.
# relevel to get untreated as reference dds$treatment <- relevel(dds$treatment, "untreated")
Differential expression analysis
The DESeq function is a wrapper that performs a default analysis through four steps:
- estimation of size factors
- estimation of dispersions by Negative Binomial GLM fitting
- calculation of log fold changes
- testing for differential expression using the Wald test
Calculation of dispersions
Dispersions are measures of the biological variation in each group. They are estimated by fitting a negative binomial model to the raw counts of biological replicates.
# run the combined DESeq2 analysis dds <- DESeq(dds) # estimating size factors # estimating dispersions # gene-wise dispersion estimates # mean-dispersion relationship # final dispersion estimates # fitting model and testing # Dispersion plot and fitting alternatives plotDispEsts(dds)
In the plot, dispersion estimates are plotted versus the mean normalized count of each gene (black dots). A curve (red) is fit to the resulting data cloud to capture the overall trend of dispersion-mean dependence. This curve is used to 'shrink' the dispersion coefficients towards the consensus represented by the red curve (blue points). The black points circled in blue are detected as dispersion outliers and not shrunk toward the curve. This shrinkage assumes that genes with similar expression levels should have similar dispersion.
Calculation of size factors
Size factors are related to library size normalization. They are estimated as follows:
- Take the geometric mean over all samples for each gene and use that as the reference expression data set.
- For each sample, get a list of quotients of each gene expression value to its reference expression.
- The median of each condition quotient list is the normalization factor (=size factor) for that data set.
# get size factors used for normalization sizeFactors(dds)
1.0123484 0.8951975 1.1747041 0.6739964 1.1729844 1.4049307 0.9193101 0.9527669
Calculation of log fold changes
Next DESeq calculates log2 fold changes (log2FoldChange) using the normalized counts. A problem occurring in all RNA-Seq data sets is the fact that genes with low read counts have unrealistically high log fold changes. This phenomenon is a direct consequence of dealing with count data, in which ratios are inherently noisier when counts are low. DESeq handles this problem by shrinking log fold changes toward zero in a manner such that shrinkage is stronger when counts are low or dispersion is high. This is done by using a regularized logarithm transformation (rlog), which behaves similarly to a log2 transformation for genes with high counts, while shrinking the values for genes with low counts. To do an rlog transformation you should use the rlog() function on the raw dds object (so without running the DESeq() function on it). For a detailed description see section 3 of this page.
rld <- rlog(dds, blind=FALSE)
Estimation of differential expression
The results file is large containing results for all 55773 genes see it is not a good idea to look at the full results table, use head() instead. Head() only shows the first rows of the file, you can specify the number of rows to show.
#store results into a new object res <- results(dds) #results, 4 first rows str(res) head(res, 4)
This shows a file containing 6 columns
Wald test p-value: treatment Dex vs untreated
DataFrame with 4 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003 698.9255 -0.38812002 0.1001493 -3.8754129 0.0001064441 0.0008072674
ENSG00000000005 0.0000 NA NA NA NA NA
ENSG00000000419 475.4853 0.21128035 0.1131837 1.8667038 0.0619429686 0.1695057180
ENSG00000000457 251.9015 0.01368663 0.1325385 0.1032653 0.9177523856 0.9590952696
The lfcSE column contains the standard errors of the log2FoldChange estimates.
DESeq performs the Wald test for differential expression, one test for each gene. The Wald test divides the estimate of the log fold change (log2FoldChange) by its standard error (lfcSE), resulting in the Wald statistic (stat). The statistic is compared to a standard normal distribution, resulting in a p-value (pvalue).
Since we perform multiple tests (one for each gene) on the same data set, the p-values must be adjusted for multiple testing. Adjustment is done using Benjamini and Hochberg's method to control the false discovery rate (FDR). The meaning of the adjusted p-value (padj) is as follows: if you select all genes with adjusted p-value below 0.05 as DE, then the expected proportion of false discoveries in the selected group is less than 5% (= FDR). So if you have 1000 DE genes in a total of 20000 genes, 50 of them will be false positives (5% of 1000) and will in reality not be DE.
When you report the results (list of DE genes) you should report the adjusted p-values !
Setting thresholds for differential expression
Some genes contain missing values (represented by NA). These are genes with zero counts in one or both groups of samples. When calculating the log fold change this gives an error since you cannot compute the log of zero. You can remove these genes from the results table by checking if the adjusted p=value equals NA using the is.na() function.
# keep only data with computed padj resFilt <- res[!is.na(res$padj),]
Check the first 4 rows of the filtered results to see if the gene with the missing p-values was removed.
Wald test p-value: treatment Dex vs untreated
DataFrame with 4 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003 698.92549 -0.38812002 0.1001493 -3.8754129 0.0001064441 0.0008072674
ENSG00000000419 475.48532 0.21128035 0.1131837 1.8667038 0.0619429686 0.1695057180
ENSG00000000457 251.90150 0.01368663 0.1325385 0.1032653 0.9177523856 0.9590952696
ENSG00000000460 50.84756 -0.02705074 0.2608112 -0.1037177 0.9173933898 0.9589720870
The results file contains the genes in original order, it might be better to reorder them putting the most significant genes (lowest p-value) at the top.
#reorder the results by decreasing significance resOrdered <- res[order(res$padj),]
Check the first 6 rows of the ordered results to see if the order of the genes was changed.
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000165995 514.2841 3.321662 0.1307366 25.40728 2.095444e-142 3.170407e-138
ENSG00000152583 985.5593 4.340812 0.1760858 24.65169 3.529480e-134 2.670051e-130
ENSG00000120129 3325.4027 2.873150 0.1167734 24.60448 1.131211e-133 5.705074e-130
ENSG00000101347 13616.9348 3.606558 0.1517550 23.76566 7.569343e-125 2.863104e-121
ENSG00000189221 2294.7300 3.231983 0.1396172 23.14888 1.491976e-118 4.514719e-115
ENSG00000211445 12162.4869 3.540681 0.1573440 22.50281 3.895921e-112 9.824214e-109
Setting different thresholds for DE
#select genes with adjusted p-value < 0.1 that are at least two-fold DE resFilt[abs(resFilt$log2FoldChange)>=1 & (resFilt$padj < 0.1),]
Select genes with adjusted p-value < 0.05 and store in an object called resSig
More information on results columns: information about which variables and tests were used can be found by calling the function mcols() on the results object.
mcols(res)$description # [1] "mean of normalized counts for all samples" # [2] "log2 fold change (MAP): treatment Dex vs untreated" # [3] "standard error: treatment Dex vs untreated" # [4] "Wald statistic: treatment Dex vs untreated" # [5] "Wald test p-value: treatment Dex vs untreated" # [6] "BH adjusted p-values"
Detection outliers and filtering genes with very low counts
# We can summarize some basic statistics using the summary function. summary(res)
This lists the number of genes that are considered up- and downregulated according to the Wald test.
DESeq uses the Cook's distance to detect outliers that do not fit the distributional assumptions of the model. An example of such an outlier would be a gene with single-digit counts for all samples, except one sample with a count in the thousands. As you can see in our data set no outliers were detected.
Additionally, DESeq filters out genes with very low counts (1 or 2 in a few samples) to reduce the number of tests that needs to be performed. This is why genes that have little or no chance of being detected as DE (genes with very low counts) are omitted from the testing. DESeq2 uses the average expression level of each gene, across all samples, as filter criterion, and it omits all genes with mean normalized counts below a filtering threshold. DESeq2 by default will choose a threshold that maximizes the number of genes found at a user-specified target FDR.
adjusted p-value < 0.1
LFC > 0 (up) : 2467, 8.6%
LFC < 0 (down) : 2155, 7.5%
outliers [1] : 0, 0%
low counts [2] : 11602, 40%
(mean count < 4)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
As you can see the FDR is set to 10% in the summary. If you think this is too high you can make the summary more stringent.
## being more stringent summary(res, alpha=0.01)
adjusted p-value < 0.01
LFC > 0 (up) : 1562, 5.4%
LFC < 0 (down) : 1284, 4.5%
outliers [1] : 0, 0%
low counts [2] : 13600, 47%
(mean count < 11.2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
How many genes are up- and down-regulated when you apply an FDR of 0.1% ?
Plotting results
MA plots using the plotMA command
MA plots show the estimated log fold changes (Y-axis) over average expression level (X-axis) with red points indicating genes with low adjusted p-values (DE genes).
# MA-plot # legend: Points will be colored red if the adjusted p value is less than 0.1. # Points which fall out of the window are plotted as open triangles pointing either up or down. plotMA(res, main="MAplot after DESeq2 analysis", ylim=c(-2,2))
MA plots using ggplot2
It is possible to create the MA-plot in ggplot2, this gives you freedom to define the plot exactly as you want it. However, ggplot2 can't handle DESeqResults objects, it uses a simple data frame as an input.
#ggplot2 can't handle DESeqResults objects ggplot(res, aes(x=baseMean,y=log2FoldChange)) + geom_point()
This gives the following error:
Error: ggplot2 doesn't know how to deal with data of class DESeqResults
To resolve this we need to transform the DESeqResults object into a data frame using the as.data.frame() function.
results <- as.data.frame(res) head(results)
Create a default scatter plot of the data using ggplot()
This gives a plot that looks nothing like the plot generated by plotMA.
First we'll fix the X-axis. To make it alike the plotMA() plot, we have to use a log10 scale on the X-axis and set its range to 0,1 to 300000.
#default scatter plot of the data ggplot(results, aes(x=baseMean,y=log2FoldChange)) + geom_point() + scale_x_continuous(trans="log10",limits=c(0.1,300000))
This looks a lot better
Now colour the data points with low adjusted p-values (< 0.1) red
ggplot(results, aes(x=baseMean,y=log2FoldChange)) + geom_point(aes(colour=padj<0.1)) + scale_colour_manual(name='padj < 0.1',values=setNames(c('red','black'),c(T,F))) + scale_x_continuous(trans="log10",limits=c(0.1,300000))
You see that you have lost some points especially in the region with small X-values. These are points with missing adjusted p-values. plotMA keeps these points in the plot coloring them black. The easiest way to fix this if to replace the missing p-values by 0.99
Create a new variable resFixed, a copy of results. In resFixed replace missing values by 0.99 and make the plot again
Now the plot looks much better
Downsize the dots to size 0.5
ggplot(resFixed, aes(x=baseMean,y=log2FoldChange)) + geom_point(aes(colour=padj<0.1),size=0.5) + scale_colour_manual(name='padj < 0.1',values=setNames(c('red','black'),c(T,F))) + scale_x_continuous(trans="log10",limits=c(0.1,300000))
It's starting to become a very nice plot
The MA plot contains a red line that fits the data points and is obtained via non-linear regression using Loess smoothing. Loess divides the data set in the local subsets of points with similar x-values (i.e. close to each other on the plot). It fits a simple linear model to these subsets overall generating a non-linear curve consisting of small linear parts.
Add a red smoother line to the plot
ggplot(resFixed, aes(x=baseMean,y=log2FoldChange)) + geom_point(aes(colour=padj<0.1),size=0.5) + scale_colour_manual(name='padj < 0.1',values=setNames(c('red','black'),c(T,F))) + scale_x_continuous(trans="log10",limits=c(0.1,300000)) + geom_smooth(colour="red")
The MA plot contains a blue horizontal line which serves as a reference to the red smoother line. In theory the red curve and the blue line should coincide. Sine you expect that there are equal amounts of genes with high and genes with low log fold changes, the red curve (representing the average gene) should coincide with log fold change = 0 (the blue horizontal line Y = 0).
Add a blue horizonzal line to the plot
ggplot(resFixed, aes(x=baseMean,y=log2FoldChange)) + geom_point(aes(colour=padj<0.1),size=0.5) + scale_colour_manual(name='padj < 0.1',values=setNames(c('red','black'),c(T,F))) + scale_x_continuous(trans="log10",limits = c(0.1,300000)) + geom_smooth(colour="red") + geom_abline(slope=0,intercept=0,colour="blue")
Plot counts using the plotCounts() function
#retrieve the name of the best scoring gene (with the lowest padj) #gene names are used as rownames in the results table onegene <- rownames(res)[which.min(res$padj)] #plot the counts for this gene plotCounts(dds,gene=which.min(res$padj),intgroup="treatment",main=onegene)
You can also plot the counts of a gene by using its gene name
plotCounts(dds, gene="ENSG00000000003", intgroup="treatment", main="ENSG00000000003")
Plot the counts of gene ENSG00000154734
You can also plot counts for each cell line instead of comparing between treatments
plotCounts(dds, gene=which.min(res$padj), intgroup="cells", main=onegene)
Plot the counts for each cell line of gene ENSG00000154734
Plot counts using ggplot2
An alternative way to transform a complex data type like a DESeqResults object (like res) or a DESeqDataSet (like dds) into a data frame that ggplot2 can interpret is to store the results of the plot functions into a variable
d <- plotCounts(dds,gene=which.min(res$padj),intgroup="treatment",main=onegene,returnData=TRUE) class(d)
As you can see d is a data frame.
Plot the counts of d in a default scatter plot.
Use a log10 transformed Y-scale.
Set tick marks on the Y-axis at 25 (start plot), 100 and 400
ggplot(d,aes(x=treatment,y=count)) + geom_point() + scale_y_continuous(trans="log10",breaks=c(25,100,400))
Add the name of the gene as the title to the plot
ggplot(d, aes(x=treatment, y=count)) + geom_point() + scale_y_continuous(trans="log10",breaks=c(25,100,400)) + labs(title=onegene)
If you want to colour the dots according to the cell line they comre from, you need to store this information into the d data frame. When we created d, we only stored info about the treatment by defining intgroup="treatment".
Create d but this time include info about treatment and cell line ? |
---|
d <- plotCounts(dds,gene=which.min(res$padj),intgroup=c("treatment","cells"),main=onegene,returnData=TRUE) |
Plot the counts of d in a scatter plot with log transformed Y axis, tick marks and title. Colour the dots according to cell line.
Plot distribution of log fold changes using the hist() function
We will plot the distribution of the log2 fold changes of all genes that are considered DE according to the following threshold: padj < 0.05
#selection of these genes was already done: they were stored in resSig #hist() plots a frequency distribution #xlim sets the range of the X-axis #breaks specifies number of bins in the histogram hist(resSig$log2FoldChange,xlim=c(-6,6),breaks=20,main="LFC distribution for genes with padj<0.05")
As you see in the plot, many DE genes have rather low log fold changes. Typically, scientists only report DE genes with minimally 2-fold, 3-fold or 4-fold changes.
Select genes from resFixed with padj < 0.01 and at least two-fold DE, and store in a variable calles resString. |
---|
resString <- resFixed[abs(resFixed$log2FoldChange)>=1 & (resFixed$padj < 0.01),] head(resString) |
Make a histogram of log fold changes of these genes.
Plot distribution of log fold changes using ggplot2
Create a default histogram of the log fold changes of resString
#default histogram ggplot(resString,aes(x=log2FoldChange)) + geom_histogram()
Make a histogram with 20 bins
#default histogram ggplot(resString,aes(x=log2FoldChange)) + geom_histogram(bins=20)
Set the range X-axis from -6 to 6
Volcano plot using the plot() function
Next we create a Volcano plot for the DE genes. The Volcano plot arranges the DE genes along dimensions of biological and statistical significance. The X-axis represents the log fold change between the two groups, and the Y-axis represents the adjusted p-value (on an inverse scale so smaller p-values appear higher up). The X-axis indicates biological relevance; the Y-axis indicates the statistical evidence, or reliability of the change in expression. The researcher can make judgements about the most promising candidates for follow-up studies, by trading off both these criteria by eye.
# Volcano plot for the subset with padj < 0.05 plot(resSig$log2FoldChange, 1-resSig$padj,xlim=c(-6,6),main="volcano plot for genes with padj<0.05")
Make the same plot but now for genes with 2-fold changes or higher and padj < 0.01
Volcano plot using ggplot2
Create a default scatter plot of the resString data
Use hollow points
# Volcano plot for the subset with padj < 0.05 ggplot(resString,aes(x=log2FoldChange,y=padj)) + geom_point(shape=1)
Invert the Y-axis
# Volcano plot for the subset with padj < 0.05 ggplot(resString,aes(x=log2FoldChange,y=padj)) + geom_point(shape=1) + scale_y_reverse()
Heat map using the heatmap() function
We will create a heat map showing the normalized counts of the 20 most significant genes (genes with the lowest padj).
#specify a color palette hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) #select normalized counts for 20 most significant genes #view 20 most significant genes resOrdered[1:20,] #view gene names of 20 most significant genes rownames(resOrdered[1:20,]) #select rows from dds of these genes select <- dds[rownames(resOrdered[1:20,]),] #show heat map of normalized counts for these genes heatmap.2(counts(select,normalized=TRUE), col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10,6))
Make the same plot for the 10 most significant genes.
Heat map using ggplot2
Retrieve the counts of the 10 selected genes ? |
---|
data <- counts(select,normalized=T) |
We need to reformat the data to make a tile plot. Now the data is in the following format:
data
SRR1039508 SRR1039509 SRR1039512 SRR1039513 ENSG00000165995 76.06077 861.2625 114.07128 1275.971 ENSG00000120129 637.13244 5499.3451 1004.50828 6608.344 ENSG00000152583 57.29253 2267.6560 80.02015 1724.045 ENSG00000101347 1700.00763 20487.0991 1913.67340 31566.934
This is called wide format. However, ggplot expects data in long format like this:
1 ENSG00000165995 SRR1039508 76.06077 2 ENSG00000120129 SRR1039508 637.13244 3 ENSG00000152583 SRR1039508 57.29253 4 ENSG00000101347 SRR1039508 1700.00763
To transform data from wide to long format we can use the melt() function of the reshape2 package:
datamelted <- melt(data) datamelted
This indeed gets the data in the right format
> datamelted Var1 Var2 value 1 ENSG00000165995 SRR1039508 76.06077 2 ENSG00000120129 SRR1039508 637.13244
Now make the default tile plot with ggplot2
ggplot(datamelted,aes(x=Var2,y=Var1)) + geom_tile(aes(fill=value))
To change the colors of the color gradient, you can define a color for low values and a color for high values:
ggplot(datamelted,aes(x=Var2,y=Var1)) + geom_tile(aes(fill=value)) + scale_fill_gradient(low="white",high="steel blue")
Color the border of the tiles white
The X-axis is still pretty messy with overlapping sample names. To print the sample names vertically use a theme element:
ggplot(datamelted,aes(x=Var2,y=Var1)) + geom_tile(aes(fill=value),colour="white") + scale_fill_gradient(low="white",high="steel blue") + theme(axis.text.x=element_text(angle=90))
Exporting results to HTML or CSV files
We use write.csv() to write a data frame to a csv (comma separated values) file. The file we want to save is the ordered results: resOrdered. We have to check first if this is a data frame using the class() function.
class(resOrdered) write.csv(as.data.frame(resOrdered), file="Dex_vs_untreated_results.csv")
Since it's not a data frame we transform it into a data frame using as.data.frame(). Using the file argument we provide a name for the csv file. The write.csv() will now write the resOrdered data frame to a file called Dex_vs_untreated_results.csv in your working directory. If you want to save it elsewhere you have to provide a path on your computer instead of a file name.
Data Transformation and Visualization
Data transformation can be applied in different ways to compensate for the dispersion at low and high expression ranges. These methods were developed after the first wave of RNASeq analysis to try compensate for effects associated with this type of data. Remember that weakly expressed genes show much higher log fold changes than strongly expressed genes. This phenomenon, seen in most RNASeq datasets, is a direct consequence of dealing with count data, in which ratios are inherently noisier when counts are low.
Therefore, it is useful to transform data to diminish this effect. If data are used on the original count scale, the result will be dominated by highly expressed, highly variable genes; if logarithm-transformed data are used, undue weight will be given to weakly expressed genes, which show exaggerated LFCs, as discussed above. Therefore, DESeq2 implements a regularized logarithm transformation (rlog), which behaves similarly to a log2 transformation for genes with high counts, while shrinking the values for genes with low counts.
Count data transformation
We present here two log fold change shrinkage methods used in DESq2:
- Regularized log transformation
- Variance stabilizing transformation
Regularized log transformation
# Regularized log transformation rld <- rlog(dds) rld
dim: 57773 8
exptData(0):
assays(1): ''
rownames(57773): ENSG00000000003 ENSG00000000005 ... ENSG00000273492 ENSG00000273493
rowData metadata column names(47): baseMean baseVar ... dispFit rlogIntercept
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(3): cells treatment sizeFactor
Variance stabilizing transformation
# Variance stabilizing transformation vsd <- varianceStabilizingTransformation(dds) vsd
dim: 57773 8
exptData(0):
assays(1): ''
rownames(57773): ENSG00000000003 ENSG00000000005 ... ENSG00000273492 ENSG00000273493
rowData metadata column names(46): baseMean baseVar ... dispGeneEst dispFit
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(3): cells treatment sizeFactor
# create matrix for saving or further use rlogMat <- assay(rld) head(rlogMat)
ENSG00000000003 9.385815 9.111922 9.485295 9.293740 9.745669 9.502458 9.584332
ENSG00000000005 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ENSG00000000419 8.790046 9.016464 8.887312 8.928876 8.843568 8.958822 8.761465
ENSG00000000457 7.979714 7.978258 7.915768 7.997564 7.953955 7.986459 8.036356
ENSG00000000460 5.702651 5.735228 5.226608 5.562887 5.683232 5.454195 5.909199
ENSG00000000938 -2.199605 -2.199305 -2.164748 -2.198557 -2.164706 -2.200320 -2.199371
SRR1039521
ENSG00000000003 9.285416
ENSG00000000005 0.000000
ENSG00000000419 8.921048
ENSG00000000457 7.958014
ENSG00000000460 5.743980
ENSG00000000938 -2.199459
# create matrix for saving or further use vstMat <- assay(vsd) head(vstMat)
ENSG00000000003 9.434437 9.016021 9.582868 9.294533 9.964458 9.608177 9.729533
ENSG00000000005 4.063874 4.063874 4.063874 4.063874 4.063874 4.063874 4.063874
ENSG00000000419 8.837919 9.175165 8.983583 9.045680 8.918437 9.089329 8.794405
ENSG00000000457 8.156567 8.154439 8.062614 8.183129 8.118785 8.166373 8.239793
ENSG00000000460 6.417750 6.465505 5.723628 6.213135 6.389237 6.066048 6.710763
ENSG00000000938 4.063874 4.063874 4.388676 4.063874 4.388913 4.063874 4.063874
SRR1039521
ENSG00000000003 9.282765
ENSG00000000005 4.063874
ENSG00000000419 9.033742
ENSG00000000457 8.124620
ENSG00000000460 6.477112
ENSG00000000938 4.063874
Data quality assessment by sample clustering and visualization
Clustering is often used to show sample segregation and identify co-expressed genes across groups. These analyses are done on rlog transformed data as recommended by the developers. Example code below can be adapted to fit your needs and create publication-ready figures with some efforts and patience.
Heatmap of the sample-to-sample distances
# we need to transpose the data first with t() distsRL <- dist(t(assay(rld))) # store in to matrix mat <- as.matrix(distsRL) # rename samples rownames(mat) <- colnames(mat) <- with(colData(dds), paste(treatment, cells, sep=" : ")) # cluster pairwise hc <- hclust(distsRL) # plot heatmap heatmap.2(mat, Rowv=as.dendrogram(hc), symm=TRUE, trace="none", col = rev(hmcol), margin=c(13, 13) )
Principal component plot of the samples
# simple plot without grouping plotPCA(rld, intgroup=c("treatment", "cells"))
# It is also possible to customize the PCA plot using the ggplot function. data <- plotPCA(rld, intgroup=c("treatment", "cells"), returnData=TRUE) percentVar <- round(100 * attr(data, "percentVar")) ggplot(data, aes(PC1, PC2, color=treatment, shape=cells)) + geom_point(size=3) + xlab(paste0("PC1: ", percentVar[1], "% variance")) + ylab(paste0("PC2: ", percentVar[2], "% variance"))
Filtering and testing to create candidate DE gene lists
It is now time to restrict the transcriptome to the part that shows apparent differential expression under the assumptions defined by the user. Some cases require specific testing introduced below but for most cases, a simple subsetting of the full transformed table will be enough (see ?subset()).
Tests of log2 fold change above or below a threshold
In some cases, one only wants to know about Up-regulated or Down-regulated genes and therefore is interested by single side testing.
Other ways of testing by choosing the minimal effect size and direction (tails) are proposed.
- greaterAbs - |beta| > lfcThreshold - tests are two-tailed
- lessAbs - |beta| < lfcThreshold - p values are the maximum of the upper and lower tests
- greater - beta > lfcThreshold
- less - beta < lfcThreshold
ddsNoPrior <- DESeq(dds, betaPrior=FALSE) # compute four tests resGA <- results(dds, lfcThreshold=.5, altHypothesis="greaterAbs") resLA <- results(ddsNoPrior, lfcThreshold=.5, altHypothesis="lessAbs") resG <- results(dds, lfcThreshold=.5, altHypothesis="greater") resL <- results(dds, lfcThreshold=.5, altHypothesis="less") # plot MA for each test par(mfrow=c(2,2),mar=c(2,2,1,1)) yl <- c(-2.5,2.5) plotMA(resGA, ylim=yl) abline(h=c(-.5,.5),col="dodgerblue",lwd=2) plotMA(resLA, ylim=yl) abline(h=c(-.5,.5),col="dodgerblue",lwd=2) plotMA(resG, ylim=yl) abline(h=.5,col="dodgerblue",lwd=2) plotMA(resL, ylim=yl) abline(h=-.5,col="dodgerblue",lwd=2)
A Magic Moment with RStudio
If all your code validated during the manual run, check what comes after clicking on Knit PDF (PDF)
Warning: Now you will not say that the trainer is a command-line addict anymore. RStudio IS a GREAT piece of GUI that 'almost' makes R 'easy' ... !!
Acknowledging R
This exercise was possible thanks to the DESeq2 vignette and to many packages listed below
sessionInfo()
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] gplots_2.16.0 RColorBrewer_1.1-2 vsn_3.34.0
[4] Biobase_2.26.0 ggplot2_1.0.0 DESeq2_1.6.3
[7] RcppArmadillo_0.4.650.1.1 Rcpp_0.11.5 GenomicRanges_1.18.4
[10] GenomeInfoDb_1.2.4 IRanges_2.0.1 S4Vectors_0.4.0
[13] BiocGenerics_0.12.1 stringr_0.6.2 corrgram_1.7
loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 affy_1.44.0 affyio_1.34.0 annotate_1.44.0
[5] AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.9
[9] BiocInstaller_1.16.2 BiocParallel_1.0.3 bitops_1.0-6 brew_1.0-6
[13] caTools_1.17.1 checkmate_1.5.1 cluster_2.0.1 codetools_0.2-11
[17] colorspace_1.2-6 DBI_0.3.1 digest_0.6.8 fail_1.2
[21] foreach_1.4.2 foreign_0.8-63 Formula_1.2-0 gclus_1.3.1
[25] gdata_2.13.3 genefilter_1.48.1 geneplotter_1.44.0 grid_3.1.3
[29] gtable_0.1.2 gtools_3.4.1 Hmisc_3.15-0 iterators_1.0.7
[33] KernSmooth_2.23-14 lattice_0.20-30 latticeExtra_0.6-26 limma_3.22.7
[37] locfit_1.5-9.1 MASS_7.3-39 munsell_0.4.2 nnet_7.3-9
[41] plyr_1.8.1 preprocessCore_1.28.0 proto_0.3-10 reshape2_1.4.1
[45] rpart_4.1-9 RSQLite_1.0.0 scales_0.2.4 sendmailR_1.2-1
[49] seriation_1.0-14 splines_3.1.3 survival_2.38-1 tools_3.1.3
[53] TSP_1.0-10 XML_3.98-1.1 xtable_1.7-4 XVector_0.6.0
[57] zlibbioc_1.12.0
download exercise files
Download exercise files here.
References:
- ↑ http://master.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
- ↑ http://master.bioconductor.org/packages/release/bioc/html/DESeq2.html
- ↑ http://www.bioconductor.org/help/workflows/rnaseqGene/
- ↑ http://bioconductor.org/packages/release/bioc/html/DESeq2.html
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.4 | NGS_RNASeq_DE_Exercise.6 ]