How to define a table of DE genes in a microarray experiment

Go to parent Analyze your own microarray data in R/Bioconductor

Let's go back to the simple example of the single comparison of 3 wild-type and 3 mutant samples.

How to adjust for multiple testing for a single comparison ?
The topTable() method returns a table ranking the genes according to evidence for differential expression. The coef parameter specifies the column of data.fit.eb that should be used. In our example in Arabidopsis coef=2 since the second column of data.fit.eb contains the results of the comparison between mutant and control plants. In our example of paired data data.fit.eb contains four columns: the second containing the results of the comparison between patient2 and patient1 the third containing the results of the comparison between patient3 and patient1 the fourth containing the results of the comparison between before and after the treatment So in this example you set coef=2 or coef=3 if you want to find genes that are DE between patients or coef=4 if you want to find genes that are DE as a result of the treatment. The number of genes that is to be returned is specified by the number argument. In most cases, a ranking of genes according to evidence for DE is sufficient because only a limited number of DE genes can be used for further study. Additionally, the topTable() method will adjust the p-value obtained from the moderated t-test for multiple testing. The adjustment method is defined by the adjust.method argument. In this case, the adjustment is done using BH which is Benjamini and Hochberg's method to control the FDR. The meaning of the adjusted p-value is as follows: if you select all genes with adjusted p-value below 0.05 as DE, then the expected proportion of false positives in the selected group should be less than 5%. So if you select 100 DE genes at a false discovery rate of 0.05, only 5 of them will be false positives. options(digits=2) tab = topTable(data.fit.eb,coef=2,number=200,adjust.method=”BH”) By default topTable() ranks the genes according to the last column (B = log odds scores) but you can specify in the sort.by argument the criterion you want to use to do the selection. The choices are: logFC to sort by the (absolute) coefficient representing the log-fold-change A to sort by average expression level (over all arrays) in descending order T for absolute t-statistics P for p-values

Many scientific papers quote the non-adjusted p-values, however this is not a good idea for the massive number of comparisons you make for the identification of DE genes. Adjusted p-values accompanied by the FDR you used as a cutoff is much more accurate. As of yet no conventions have been established for false discovery rate in published work. An FDR of 5% or less should be acceptable for journal publication of gene lists.

As you can see topTable() does not allow you to sort on adjusted p-value or to select the genes with an adjusted p-value below a certain cutoff and in most cases, this is exactly what you want to do.

How to select a set of genes with adjusted p-values below a threshold ?
You have create a subset of the table generated by topTable() with adjusted p-values (last column of tab called adj.P.Val) below a threshold (in this example the threshold is set at 0.001): topgenes = tab[tab[, "adj.P.Val"] < 0.001, ] dim(topgenes) The dim() method will tell you how many genes fulfill this criterium. If you want to distinguish between up- and downregulated genes and you want to include a log fold change threshold, you need to create another subset of the remaining table: topups = topgenes[topgenes[, "logFC"] > 1, ] dim(topups) topdowns = topgenes[topgenes[, "logFC"] < -1, ] dim(topdowns)

A large number of the genes that have an adjusted p-value below the threshold (adj. p-value < 0.001) have fold changes below two-fold. Although the changes of these genes are significant (since the adjusted p-value is so low), most people do not select genes with changes in gene expression below two-fold.

How to check if a specific gene is DE according to the criteria that we have set ?
The tab table that is generated by the topTable() method and all the tables that are derived from it like topups and topdowns contain the probe set IDs of the selected genes as row names: topgenes = tab[tab[, "adj.P.Val"] < 0.001, ] dim(topgenes) The dim() method will tell you how many genes fulfill this criterium. If you want to distinguish between up- and downregulated genes and you want to include a log fold change threshold, you need to create another subset of the remaining table. So if you want to check if a gene is DE, you can ask for the row in topups or topdowns that contain the probe set ID of that gene as a row name, e.g. to check if 256852_at is upregulated: topups["256852_at",] If the command returns a result the gene is included in the list of upregulated genes.

There's an alternative method for finding DE genes that allows you to specify a false discovery rate and a threshold for the log fold change in a single command.
It is especially interesting when you do multiple comparisons, as in the example of the 3 groups of mice samples. If you use topTable() for this, you have to perform the adjustment thrice, once for each comparison, each time changing the value of the coef parameter.

How to adjust for multiple testing for multiple comparisons ?
In the example of the 3 groups of mice samples, the eBayes() method returned a data frame data.fit.eb containing the results of the moderated ANOVA and the subsequent pairwise comparisons. The slot called p.value contains the p-values of the pairwise comparisons. The decideTests() method will perform multiple testing adjustment on these p-values. Additionally, it will evaluate for each gene whether the results data.fit.eb fulfill the criteria for differential expression that you specify. The adjust.method argument specifies which method is used to adjust the p-values for multiple testing. The value BH means that Benjamini-Hochberg correction will be used. The p.value argument specifies the FDR and the lfc argument specifies the minimal fold change that is required to be considered DE. The method argument specifies how the p-values are adjusted: separate (default) looks at all the contrasts individually. It is equivalent to using topTable() separately for each contrast, and will give the same lists of DE genes if the value of the adjust.method argument is the same. It does multiple testing adjustment for each individual contrast but it does not do any multiple testing adjustment between contrasts. However, you should do adjustment in both directions if you perform multiple comparisons. As a result, the p-value cutoff can be very different for different contrasts. Only use this method if you have a few contrasts and you want to use the simplest method. It will generate more false positives than the other two methods. global: all contrasts are considered independent. The method will treat the entire matrix of t-statistics as a single vector of independent tests. It is the simplest and obvious choice if you want to do multiple testing in both directions simultaneously. The p-value cutoff will be consistent across all contrasts. nestedF adjusts the p-values associated with the F-statistic. This returns a list of genes that are DE, but you don't know for which contrast(s) that may be true. The t-statistics associated with these genes are therefore inspected and the largest one (in absolute value) is considered significant. There may be other contrasts that are significant as well, so the largest t-statistic is set to the same absolute value as the second largest t-statistic, and the F-statistic is calculated again. If the F-statistic is still significant, the contrast with the second largest t-statistic is also considered significant. This procedure is continued until the F-statistic is no longer significant. This method gives good results when you want to focus on genes which are DE in multiple comparisons, but is least powerful for picking up genes which are DE in only one contrast. DEresults = decideTests(data.fit.eb,method='global',adjust.method="BH",p.value=0.05,lfc=1) DEresults[1:10,] For each gene and each comparison it will generate the following output: -1: significantly downregulated 0: no significant evidence of differential expression 1: significantly upregulated

How to define a table of DE genes in a microarray experiment

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox