Tutorial: Testing for differential expression I
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Tutorials
Contents
We start by comparing two groups of samples.
Some background
The classical statistical tool for comparing the means of two groups of samples is the t-test. In theory, all we need to do is calculate a t-test for each gene (often represented by a combination of probes, the probeset) in our expression data set and pick only the ones that are significant. This, however, raises two important questions:
- What is significant? If we use the usual 5% significance level and reject the null hypothesis of equal means if the p-value is below 0.05, we will drown in false positive findings.
- Can we combine some information across genes, or do we have to consider all genes/probesets in isolation?
Limma
(linear models for microarray analysis) is a powerful Bioconductor package that can easily generate lists of differentially expressed genes with p-values properly adjusted for multiple testing while borrowing strength by pooling variance estimates across similar genes. This works nicely for simple comparisons of two groups, but also for more complicated designs, as we will see below and tomorrow.
Multiple testing adjustments
Need some additional introduction
The two best known ones are
- Bonferroni-Holm: very strict - controls the familywise error probability, i.e. the probability of making one mistake among all the many tests required for the probesets on a chip.
- Benjamini-Hochberg or False discovery rate (FDR): controls the expected proportion of false positives among the list of probesets, so e.g. a cutoff of 5% for the FDR implies that we would expect to find on average 5% false positives among the probesets that we select with this criterion. This is almost always more interesting than the classical p-value adjustment.
Pooling information across genes
This is the idea of so-called moderated t-statistics, which compare mean differences between groups for one probeset with a standard error that is based on a weighted mean of the variance of the probeset itself and the variances of similar probesets. This protects to some degree against probesets with artificially small variance (e.g. due to saturation) appearing hugely significant. For a nice and short summary (as well as a reasonable discussion of p-value adjustments) see PubMed ID 16369572.
Example: Estrogen
For our small example data set, we want to identify genes differentially expressed between samples that received estrogen stimulation and those that did not (ignoring for now the time factor).
Specifying the comparison
Before we can test for differential expression, we have to specify which groups of samples we want to compare. We could just point at a useful grouping variable among the phenotypic data, but Limma
requires a slightly more precise formulation that will come in handy for more complicated designs. Specifically, we need to define a model matrix. This is easily done using the standard R
function model.matrix
:
> design = model.matrix(~estrogen, data=pData(estroRMA))
The tilde notation is very common with modelling functions in R
. The idea is that whatwever is left of the tilde depends in some kind of manner on what is on the right. In our case, we say that an unspecified left hand side (which will eventually be every single probeset in our data) is a function of some variable estrogen
, and the argument data
tells us that this variable is to be taken from the phenotypic information accompanying the estrogen expression data - just what we wanted from the beginning.
Fitting the model
With this out of the way, we can load the package and fit the linear models, one for each probeset in the expression set. Note that the fitting process is split into two function calls, lmFit
and eBayes
, which are both necessary.
> library(limma) > fit = lmFit(estroRMA, design) > efit = eBayes(fit)
The object efit
that we have generated is a list containing all the information from the two fitting steps:
> names(efit) [1] "coefficients" "rank" "assign" "qr" [5] "df.residual" "sigma" "cov.coefficients" "stdev.unscaled" [9] "pivot" "genes" "Amean" "method" [13] "design" "df.prior" "s2.prior" "var.prior" [17] "proportion" "s2.post" "t" "p.value" [21] "lods" "F" "F.p.value"<
We see some interesting entries, like t
and p.value
, but we cannot easily interpret them like this.
Displaying the results
Thank goodness we don't have to, because that is the job of the utility function topTable
:
> topTable(efit, coef=2) ID logFC AveExpr t P.Value adj.P.Val B 11508 41400_at 2.312275 10.041553 19.95216 9.200440e-09 0.0001161556 9.823671 921 1824_s_at 1.778002 9.238870 17.31254 3.205819e-08 0.0002023673 8.959536 9848 39755_at 1.722290 12.131839 15.89884 6.752797e-08 0.0002797447 8.402636 6193 36134_at 2.466614 8.275734 15.41038 8.863200e-08 0.0002797447 8.192108 7110 37043_at -1.496948 11.529434 -13.98447 2.058008e-07 0.0004936556 7.516791 6899 36833_at 1.581543 8.709900 13.77401 2.346086e-07 0.0004936556 7.408756 952 1854_at 2.924256 8.532099 13.44452 2.890951e-07 0.0005214037 7.234940 3282 33252_at 1.559903 8.000356 12.86931 4.210700e-07 0.0005570207 6.917134 1813 31798_at 3.198764 12.115778 12.75333 4.550775e-07 0.0005570207 6.850743 10524 40425_at -1.376234 9.691118 -12.72879 4.626569e-07 0.0005570207 6.836591
By default, this function lists the ten top-regulated genes, showing from left to right
- a unique identifier,
- the log2-fold change associated with the comparison we are interested in,
- the average intensity of the probeset/gene across all chips,
- the (moderated) t-statistic for the hypothesis that the log2-fold change is zero (or equivalently, that the fold change is one),
- the associated raw and adjusted p-values for the t-statistic,
- finally an estimated log-odds ratio for DE (which most people, myself included, ignore).
By default, topTable
displays only the top 10 genes, sorted by the log-odds ratio B
(in practice, this means generally sorted by increasing p-value). However, we can specify any number of genes, and
topTable(efit, coef=2, number=Inf)
will list all genes/probeset in the data set. We can limit the display by (adjusted) p.value and (absolute) log2-fold change, using the arguments p.value
and lfc
. E.g. if we want the among the top-20 genes only those with at least a log2-fold change of 2 (i.e. a fold change of 4), we use
topTable(efit, coef=2, number=20, lfc=2)
We can also sort the table by different criteria, see ?topTable
Model matrix revisited
What is the meaning of the argument coef=2
that we used for topTable
above? This is again due to the way that the comparison/model is specified. Let's look at the model matrix:
> design (Intercept) estrogenpresent high10-1.cel 1 1 high10-2.cel 1 1 high48-1.cel 1 1 high48-2.cel 1 1 low10-1.cel 1 0 low10-2.cel 1 0 low48-1.cel 1 0 low48-2.cel 1 0
The model as two parameters or coefficients. The Intercept
is a constant that is one throughout, estrogenpresent
is a 0/1 marker for estrogen stimulation. Together, they describe the mean expression level in each group: the first coefficient, the Intercept
, estimates the mean in the group without estrogen stimulation (the reference group); the sum of the first and the second coefficient describes the mean expression level in the estrogen stimulated group. Ergo, the second coefficient describes the difference between first and the second group of samples. In order to find DE probesets, we simply test whether the second coefficient is zero. Hence the coef=2
argument to topTable
above.
Note that if the second coefficient is smaller than zero, the mean for the estrogen present-group is smaller; it indicates down-regulation compared to the estrogen absen group. If the second coefficient is bigger than zero, than the probeset is by the same logic upregulated in the estrogen present-group compared to the reference group.
Plotting regulated probesets
There are two useful ways of putting the regulated probesets into context relative to the unregulated probesets in a comprehensive graphical manner: the MA-plot and the volcano plot.
MA plots of DE genes
MA-plots are versatile graphical tools with multiple applications in gene expression analysis and beyond. The general idea is that to plot some kind of difference ("minus" or M) against some kind of average (A) to see whether the differences are constant or vary systematically. Here, we plot for each probeset the effect size (the log2 fold change, which is just the mean difference between groups on the log scale) against the average intensity of the probesets. Not coincidentally, both these quantities are returned by topTable
. We start therefore with storing the full table as an object:
> estroAll = topTable(efit, coef=2, number=Inf) > nrow(estroAll) [1] 12625
This we can plot, using again a formula notation:
plot(logFC ~ AveExpr, data=estroAll, pch=".")
i.e. plot the quantity logFC
as a function of AveExpr
(both taken from estroAll
) and use as plotting character pch
a small dot (to reduce crowding). This shows a fairly symmetrical scatter around zero, with a number of probesets with larger or smaller log fold changes.
We now add the DE genes to the plot, to see where they fall. We decide for now that all genes with an adjusted p-value of less than 0.05 are interesting candidates for further research. We get the list of these genes, too:
> estroTop = topTable(efit, coef=2, num=Inf, p.value=0.05) > nrow(estroTop) [1] 451
We now highlight these probesets by adding a bigger plotting symbol in green on top of the original small dot:
> points(logFC ~ AveExpr, data=estroTop, pch=19, col="green")
We find that the DE probesets are quite equally distributed along the intensity axis, with the majority falling between 5 and 12. This is quite acceptable. Critical are situations where a large part of the DE probesets fall at the extreme end of the spectrum (very bright or very dark probesets), or where the balance between up- and down-regulated probes shifts with probeset intensity along the horizontal axis (e.g. low-intensity DE probesets more likely upregulated and high-intensity DE probesets more likely to be downregulated, or vice versa).
(Note also that while none of the DE probesets has really tiny log fold changes, there are still a number of probesets with large log fold changes that do not fulfill the criteria for our candidate list. This is because the t-statistic and corresponding p-value reflect both effect size/log fold change and variability across chips.)
Volcano plot
Here, we plot -log10 of the (adjusted) p-values on the vertical axis against the log2 fold change on the horizontal axis. Again, we plot first all probesets and add then the DE candidates in a more striking color and size:
> plot(-log10(adj.P.Val) ~ logFC, data=estroAll, pch=".") > points(-log10(adj.P.Val) ~ logFC, data=estroTop, pch=19, col="red")
This produces a typical picture of an "exploding volcano", hence the name. Note that the -log10 here is just a visual device: the log10 stretches out the space for the smallest p-values so that the don't crowd together on top of each other, and the minus just puts the most significant probesets at the top of the plot.
As the MA-plot, the volcano plot indicates no big problems. What would be highly suspicious here would be a central column of probesets with very large significance, but very small fold changes, which are typically artifacts without biological relevance.
(Where would these have shown up in the MA-plot?)