# Differential expression for more than two groups

Go to parent Introduction to R/Bioconductor for analysis of microarray data#Training Units

## Contents

## Background

Often samples come in more than two different groups. **Limma** offers fairly painless ways of making all relevant comparisons and to summarize the results.

## Example

### Experiment and data

Researchers compared gene expression between 28 mice with different diabetes status: diabetic, diabetes-resistant, and non-diabetic. The resulting data has been uploaded to the GEO repository with reference GDS10.

We directly download the probeset-level expression matrix from GEO, using the `GEOquery`

package. We take care to download log-transformed data:

require(GEOquery) gds = getGEO("GDS10") eset = GDS2eSet(gds, do.log2 = TRUE)

The resulting expression set has much meta data included, including all of the annotation data, because this is a non-standard chip. `fvarMetadata`

shows the annotation which is available.

eset fvarMetadata(eset)

We can go directly to the standard Limma analysis, where the fact that we have three instead of two groups plays no role. We define the design through the model matrix as before, except that we use slightly friendlier column names:

require(limma) design = model.matrix(~disease.state, data=pData(eset)) colnames(design) = c("diabetic", "resist-diab", "nondiab-diab") head(design)

Here we see that the reference level for the comparison is `diabetic`

, and we have two coefficients of interest: one describing the difference between diabetes-resistant mice and the reference group, and one describing the difference with non-diabetic mice.

We can do the standard limma analysis and get two gene lists:

fit1 = lmFit(eset, design) efit1 = eBayes(fit1) topTable(efit1, coef="resist-diab", genelist=fit1$genes[,7], sort.by="p") topTable(efit1, coef="nondiab-diab", genelist=fit1$genes[,7], sort.by="p")

Note two things: that we can refer to the coefficients by name and not just by number (easier to remember), and we specify a `genelist`

argument. The latter is just a consequence of having a non-standard chip, so that we have to tell `R`

where it can find the Affymetrix IDs.

We also note however that we only get information for two instead of all three possible comparisons. However, with only a little extra effort we get those, too. We start with a new design matrix:

design2 = model.matrix( ~ 0+disease.state, data=pData(eset)) colnames(design2) = c("diab", "resist", "nondiab") head(design2)

Now we no longer have a fixed reference level, but all three groups are on the same footing, and we can specify the comparisons that we want to make. This requires a re-fitting of the model with the newly defined contrasts (comparisons):

fit2 = lmFit(eset, design2) contrast.matrix = makeContrasts(diab-resist, diab-nondiab, resist-nondiab, levels=design2) fit2 = contrasts.fit(fit2, contrast.matrix)

The second fitting step is the same as before:

efit2 = eBayes(fit2)

This makes no difference for the comparisons that we got already from the standard analysis before, but now we can look at the third possible comparison directly:

topTable(efit2, coef="resist - nondiab", genelist=fit2$genes[,7], sort.by="p")

It would be useful to see which genes are actually differentially expressed for each comparison. The function `decideTests`

generates a matrix of 0s and 1s, where 0 indicates no DE for that specific probeset and comparison, and 1 means DE (by default at the 0.05 significance level, but that can be changed, see `?decideTests`

:

results = decideTests(efit2)

We count all possible combinations and find that only some actually exist:

vennCounts(results)

We also get a nice graphical representation:

vennDiagram(results)