How to define a table of DE genes in a microarray experiment
Go to parent Analyze your own microarray data in R/Bioconductor
Let's go back to the simple example of the single comparison of 3 wild-type and 3 mutant samples.
How to adjust for multiple testing for a single comparison ? |
---|
The topTable() method returns a table ranking the genes according to evidence for differential expression. The coef parameter specifies the column of data.fit.eb that should be used. In our example in Arabidopsis coef=2 since the second column of data.fit.eb contains the results of the comparison between mutant and control plants.
So in this example you set coef=2 or coef=3 if you want to find genes that are DE between patients or coef=4 if you want to find genes that are DE as a result of the treatment. The number of genes that is to be returned is specified by the number argument. In most cases, a ranking of genes according to evidence for DE is sufficient because only a limited number of DE genes can be used for further study. Additionally, the topTable() method will adjust the p-value obtained from the moderated t-test for multiple testing. The adjustment method is defined by the adjust.method argument. In this case, the adjustment is done using BH which is Benjamini and Hochberg's method to control the FDR. The meaning of the adjusted p-value is as follows: if you select all genes with adjusted p-value below 0.05 as DE, then the expected proportion of false positives in the selected group should be less than 5%. So if you select 100 DE genes at a false discovery rate of 0.05, only 5 of them will be false positives. options(digits=2) tab = topTable(data.fit.eb,coef=2,number=200,adjust.method=”BH”)
|
Many scientific papers quote the non-adjusted p-values, however this is not a good idea for the massive number of comparisons you make for the identification of DE genes. Adjusted p-values accompanied by the FDR you used as a cutoff is much more accurate. As of yet no conventions have been established for false discovery rate in published work. An FDR of 5% or less should be acceptable for journal publication of gene lists.
As you can see topTable() does not allow you to sort on adjusted p-value or to select the genes with an adjusted p-value below a certain cutoff and in most cases, this is exactly what you want to do.
How to select a set of genes with adjusted p-values below a threshold ? |
---|
You have create a subset of the table generated by topTable() with adjusted p-values (last column of tab called adj.P.Val) below a threshold (in this example the threshold is set at 0.001):
topgenes = tab[tab[, "adj.P.Val"] < 0.001, ] dim(topgenes) The dim() method will tell you how many genes fulfill this criterium. topups = topgenes[topgenes[, "logFC"] > 1, ] dim(topups) topdowns = topgenes[topgenes[, "logFC"] < -1, ] dim(topdowns) |
A large number of the genes that have an adjusted p-value below the threshold (adj. p-value < 0.001) have fold changes below two-fold. Although the changes of these genes are significant (since the adjusted p-value is so low), most people do not select genes with changes in gene expression below two-fold.
How to check if a specific gene is DE according to the criteria that we have set ? |
---|
The tab table that is generated by the topTable() method and all the tables that are derived from it like topups and topdowns contain the probe set IDs of the selected genes as row names:
topgenes = tab[tab[, "adj.P.Val"] < 0.001, ] dim(topgenes) The dim() method will tell you how many genes fulfill this criterium. topups["256852_at",] If the command returns a result the gene is included in the list of upregulated genes.
|
There's an alternative method for finding DE genes that allows you to specify a false discovery rate and a threshold for the log fold change in a single command.
It is especially interesting when you do multiple comparisons, as in the example of the 3 groups of mice samples. If you use topTable() for this, you have to perform the adjustment thrice, once for each comparison, each time changing the value of the coef parameter.
How to adjust for multiple testing for multiple comparisons ? |
---|
In the example of the 3 groups of mice samples, the eBayes() method returned a data frame data.fit.eb containing the results of the moderated ANOVA and the subsequent pairwise comparisons. The slot called p.value contains the p-values of the pairwise comparisons. The decideTests() method will perform multiple testing adjustment on these p-values. Additionally, it will evaluate for each gene whether the results data.fit.eb fulfill the criteria for differential expression that you specify. The adjust.method argument specifies which method is used to adjust the p-values for multiple testing. The value BH means that Benjamini-Hochberg correction will be used.
DEresults = decideTests(data.fit.eb,method='global',adjust.method="BH",p.value=0.05,lfc=1) DEresults[1:10,] For each gene and each comparison it will generate the following output:
|