Differential expression for more than two groups

From BITS wiki
Jump to: navigation, search
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Training Units

Background

Often samples come in more than two different groups. Limma offers fairly painless ways of making all relevant comparisons and to summarize the results.

Example

Experiment and data

Researchers compared gene expression between 28 mice with different diabetes status: diabetic, diabetes-resistant, and non-diabetic. The resulting data has been uploaded to the GEO repository with reference GDS10.

We directly download the probeset-level expression matrix from GEO, using the GEOquery package. We take care to download log-transformed data:

require(GEOquery)
gds  = getGEO("GDS10")
eset = GDS2eSet(gds, do.log2 = TRUE) 

The resulting expression set has much meta data included, including all of the annotation data, because this is a non-standard chip. fvarMetadata shows the annotation which is available.

eset
fvarMetadata(eset)

We can go directly to the standard Limma analysis, where the fact that we have three instead of two groups plays no role. We define the design through the model matrix as before, except that we use slightly friendlier column names:

require(limma)
design = model.matrix(~disease.state, data=pData(eset))
colnames(design) = c("diabetic", "resist-diab", "nondiab-diab")
head(design)

Here we see that the reference level for the comparison is diabetic, and we have two coefficients of interest: one describing the difference between diabetes-resistant mice and the reference group, and one describing the difference with non-diabetic mice.

We can do the standard limma analysis and get two gene lists:

fit1  = lmFit(eset, design)
efit1 = eBayes(fit1)        
topTable(efit1, coef="resist-diab", genelist=fit1$genes[,7], sort.by="p")
topTable(efit1, coef="nondiab-diab", genelist=fit1$genes[,7], sort.by="p")

Note two things: that we can refer to the coefficients by name and not just by number (easier to remember), and we specify a genelist argument. The latter is just a consequence of having a non-standard chip, so that we have to tell R where it can find the Affymetrix IDs.

We also note however that we only get information for two instead of all three possible comparisons. However, with only a little extra effort we get those, too. We start with a new design matrix:

design2 = model.matrix( ~ 0+disease.state, data=pData(eset))
colnames(design2) = c("diab", "resist", "nondiab")
head(design2)

Now we no longer have a fixed reference level, but all three groups are on the same footing, and we can specify the comparisons that we want to make. This requires a re-fitting of the model with the newly defined contrasts (comparisons):

fit2 = lmFit(eset, design2)
contrast.matrix = makeContrasts(diab-resist, diab-nondiab, resist-nondiab, levels=design2)
fit2 = contrasts.fit(fit2, contrast.matrix)

The second fitting step is the same as before:

efit2 = eBayes(fit2)

This makes no difference for the comparisons that we got already from the standard analysis before, but now we can look at the third possible comparison directly:

topTable(efit2, coef="resist - nondiab", genelist=fit2$genes[,7], sort.by="p")

It would be useful to see which genes are actually differentially expressed for each comparison. The function decideTests generates a matrix of 0s and 1s, where 0 indicates no DE for that specific probeset and comparison, and 1 means DE (by default at the 0.05 significance level, but that can be changed, see ?decideTests:

results = decideTests(efit2)

We count all possible combinations and find that only some actually exist:

  
vennCounts(results)

We also get a nice graphical representation:

vennDiagram(results)