Tutorial: Testing for differential expression I

From BITS wiki
Jump to: navigation, search
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Tutorials

We start by comparing two groups of samples.

Some background

The classical statistical tool for comparing the means of two groups of samples is the t-test. In theory, all we need to do is calculate a t-test for each gene (often represented by a combination of probes, the probeset) in our expression data set and pick only the ones that are significant. This, however, raises two important questions:

  1. What is significant? If we use the usual 5% significance level and reject the null hypothesis of equal means if the p-value is below 0.05, we will drown in false positive findings.
  2. Can we combine some information across genes, or do we have to consider all genes/probesets in isolation?

Limma (linear models for microarray analysis) is a powerful Bioconductor package that can easily generate lists of differentially expressed genes with p-values properly adjusted for multiple testing while borrowing strength by pooling variance estimates across similar genes. This works nicely for simple comparisons of two groups, but also for more complicated designs, as we will see below and tomorrow.

Multiple testing adjustments

Need some additional introduction

The two best known ones are

  • Bonferroni-Holm: very strict - controls the familywise error probability, i.e. the probability of making one mistake among all the many tests required for the probesets on a chip.
  • Benjamini-Hochberg or False discovery rate (FDR): controls the expected proportion of false positives among the list of probesets, so e.g. a cutoff of 5% for the FDR implies that we would expect to find on average 5% false positives among the probesets that we select with this criterion. This is almost always more interesting than the classical p-value adjustment.

Pooling information across genes

This is the idea of so-called moderated t-statistics, which compare mean differences between groups for one probeset with a standard error that is based on a weighted mean of the variance of the probeset itself and the variances of similar probesets. This protects to some degree against probesets with artificially small variance (e.g. due to saturation) appearing hugely significant. For a nice and short summary (as well as a reasonable discussion of p-value adjustments) see PubMed ID 16369572.

Example: Estrogen

For our small example data set, we want to identify genes differentially expressed between samples that received estrogen stimulation and those that did not (ignoring for now the time factor).

Specifying the comparison

Before we can test for differential expression, we have to specify which groups of samples we want to compare. We could just point at a useful grouping variable among the phenotypic data, but Limma requires a slightly more precise formulation that will come in handy for more complicated designs. Specifically, we need to define a model matrix. This is easily done using the standard R function model.matrix:

> design = model.matrix(~estrogen, data=pData(estroRMA))

The tilde notation is very common with modelling functions in R. The idea is that whatwever is left of the tilde depends in some kind of manner on what is on the right. In our case, we say that an unspecified left hand side (which will eventually be every single probeset in our data) is a function of some variable estrogen, and the argument data tells us that this variable is to be taken from the phenotypic information accompanying the estrogen expression data - just what we wanted from the beginning.

Fitting the model

With this out of the way, we can load the package and fit the linear models, one for each probeset in the expression set. Note that the fitting process is split into two function calls, lmFit and eBayes, which are both necessary.

> library(limma)  
> fit  = lmFit(estroRMA, design)  
> efit = eBayes(fit)   

The object efit that we have generated is a list containing all the information from the two fitting steps:

> names(efit)
 [1] "coefficients"     "rank"             "assign"           "qr"              
 [5] "df.residual"      "sigma"            "cov.coefficients" "stdev.unscaled"  
 [9] "pivot"            "genes"            "Amean"            "method"          
[13] "design"           "df.prior"         "s2.prior"         "var.prior"       
[17] "proportion"       "s2.post"          "t"                "p.value"         
[21] "lods"             "F"                "F.p.value"<

We see some interesting entries, like t and p.value, but we cannot easily interpret them like this.

Displaying the results

Thank goodness we don't have to, because that is the job of the utility function topTable:

> topTable(efit, coef=2)
             ID     logFC   AveExpr         t      P.Value    adj.P.Val        B
11508  41400_at  2.312275 10.041553  19.95216 9.200440e-09 0.0001161556 9.823671
921   1824_s_at  1.778002  9.238870  17.31254 3.205819e-08 0.0002023673 8.959536
9848   39755_at  1.722290 12.131839  15.89884 6.752797e-08 0.0002797447 8.402636
6193   36134_at  2.466614  8.275734  15.41038 8.863200e-08 0.0002797447 8.192108
7110   37043_at -1.496948 11.529434 -13.98447 2.058008e-07 0.0004936556 7.516791
6899   36833_at  1.581543  8.709900  13.77401 2.346086e-07 0.0004936556 7.408756
952     1854_at  2.924256  8.532099  13.44452 2.890951e-07 0.0005214037 7.234940
3282   33252_at  1.559903  8.000356  12.86931 4.210700e-07 0.0005570207 6.917134
1813   31798_at  3.198764 12.115778  12.75333 4.550775e-07 0.0005570207 6.850743
10524  40425_at -1.376234  9.691118 -12.72879 4.626569e-07 0.0005570207 6.836591

By default, this function lists the ten top-regulated genes, showing from left to right

  • a unique identifier,
  • the log2-fold change associated with the comparison we are interested in,
  • the average intensity of the probeset/gene across all chips,
  • the (moderated) t-statistic for the hypothesis that the log2-fold change is zero (or equivalently, that the fold change is one),
  • the associated raw and adjusted p-values for the t-statistic,
  • finally an estimated log-odds ratio for DE (which most people, myself included, ignore).

By default, topTable displays only the top 10 genes, sorted by the log-odds ratio B (in practice, this means generally sorted by increasing p-value). However, we can specify any number of genes, and

topTable(efit, coef=2, number=Inf)

will list all genes/probeset in the data set. We can limit the display by (adjusted) p.value and (absolute) log2-fold change, using the arguments p.value and lfc. E.g. if we want the among the top-20 genes only those with at least a log2-fold change of 2 (i.e. a fold change of 4), we use

topTable(efit, coef=2, number=20, lfc=2)

We can also sort the table by different criteria, see ?topTable

Model matrix revisited

What is the meaning of the argument coef=2 that we used for topTable above? This is again due to the way that the comparison/model is specified. Let's look at the model matrix:

> design
             (Intercept) estrogenpresent
high10-1.cel           1               1
high10-2.cel           1               1
high48-1.cel           1               1
high48-2.cel           1               1
low10-1.cel            1               0
low10-2.cel            1               0
low48-1.cel            1               0
low48-2.cel            1               0

The model as two parameters or coefficients. The Intercept is a constant that is one throughout, estrogenpresent is a 0/1 marker for estrogen stimulation. Together, they describe the mean expression level in each group: the first coefficient, the Intercept, estimates the mean in the group without estrogen stimulation (the reference group); the sum of the first and the second coefficient describes the mean expression level in the estrogen stimulated group. Ergo, the second coefficient describes the difference between first and the second group of samples. In order to find DE probesets, we simply test whether the second coefficient is zero. Hence the coef=2 argument to topTable above.

Note that if the second coefficient is smaller than zero, the mean for the estrogen present-group is smaller; it indicates down-regulation compared to the estrogen absen group. If the second coefficient is bigger than zero, than the probeset is by the same logic upregulated in the estrogen present-group compared to the reference group.

Plotting regulated probesets

There are two useful ways of putting the regulated probesets into context relative to the unregulated probesets in a comprehensive graphical manner: the MA-plot and the volcano plot.

MA plots of DE genes

MA-plots are versatile graphical tools with multiple applications in gene expression analysis and beyond. The general idea is that to plot some kind of difference ("minus" or M) against some kind of average (A) to see whether the differences are constant or vary systematically. Here, we plot for each probeset the effect size (the log2 fold change, which is just the mean difference between groups on the log scale) against the average intensity of the probesets. Not coincidentally, both these quantities are returned by topTable. We start therefore with storing the full table as an object:

> estroAll = topTable(efit, coef=2, number=Inf)
> nrow(estroAll)
[1] 12625

This we can plot, using again a formula notation:

plot(logFC ~ AveExpr, data=estroAll, pch=".")

i.e. plot the quantity logFC as a function of AveExpr (both taken from estroAll) and use as plotting character pch a small dot (to reduce crowding). This shows a fairly symmetrical scatter around zero, with a number of probesets with larger or smaller log fold changes.

We now add the DE genes to the plot, to see where they fall. We decide for now that all genes with an adjusted p-value of less than 0.05 are interesting candidates for further research. We get the list of these genes, too:

> estroTop = topTable(efit, coef=2, num=Inf, p.value=0.05)
> nrow(estroTop)
[1] 451

We now highlight these probesets by adding a bigger plotting symbol in green on top of the original small dot:

> points(logFC ~ AveExpr, data=estroTop, pch=19, col="green")

We find that the DE probesets are quite equally distributed along the intensity axis, with the majority falling between 5 and 12. This is quite acceptable. Critical are situations where a large part of the DE probesets fall at the extreme end of the spectrum (very bright or very dark probesets), or where the balance between up- and down-regulated probes shifts with probeset intensity along the horizontal axis (e.g. low-intensity DE probesets more likely upregulated and high-intensity DE probesets more likely to be downregulated, or vice versa).

(Note also that while none of the DE probesets has really tiny log fold changes, there are still a number of probesets with large log fold changes that do not fulfill the criteria for our candidate list. This is because the t-statistic and corresponding p-value reflect both effect size/log fold change and variability across chips.)

Volcano plot

Here, we plot -log10 of the (adjusted) p-values on the vertical axis against the log2 fold change on the horizontal axis. Again, we plot first all probesets and add then the DE candidates in a more striking color and size:

> plot(-log10(adj.P.Val) ~ logFC, data=estroAll, pch=".")
> points(-log10(adj.P.Val) ~ logFC, data=estroTop, pch=19, col="red")

This produces a typical picture of an "exploding volcano", hence the name. Note that the -log10 here is just a visual device: the log10 stretches out the space for the smallest p-values so that the don't crowd together on top of each other, and the minus just puts the most significant probesets at the top of the plot.

As the MA-plot, the volcano plot indicates no big problems. What would be highly suspicious here would be a central column of probesets with very large significance, but very small fold changes, which are typically artifacts without biological relevance.

(Where would these have shown up in the MA-plot?)