Analyze GEO data with GEO2R
Find DE genes in a selected GEO dataset using GEO2R online
[ Main_Page ]
Use GEO2R to compare two or more groups of Samples (within a single GEO experiment) in order to identify genes that are differentially expressed across experimental conditions. Results are presented as a table of genes ordered by significance. Full instructions[1] - (Tutorial video
GEO2R does not perform sophisticated statistical analysis of MA data. Please read about GEO2R limitations here [2]
Limitations and caveats
The GEO database is a public repository that archives thousands of original high-throughput functional genomic studies submitted by the scientific community. These studies represent a large diversity of experimental types and designs, and contain data that are processed and normalized using a wide variety of methods. GEO2R can access and analyze almost any GEO Series, regardless of data type and quality, so the user must be aware of the following limitations and caveats.
Check that Sample values are comparable: GEO2R operates on Series Matrix files which contain data extracted directly from the VALUE column of Sample tables. Submitters are asked to supply normalized data in the VALUE column, rendering the Samples cross-comparable. The majority of GEO data do conform to this rule. GEO applies no further processing other than to perform a log2 transformation on values determined not to be in log space (see Options section). However, some studies, such as dual channel loop design data, may generate values that do not have a common reference and are not directly comparable. Some studies may contain Sample value data that are not normalized, or have a design such that the Samples were never intended to be directly compared. Yet other studies do not have sufficient replicate Samples to perform a robust statistical analysis. Users should examine the original Series to understand the experimental design, and check the 'Data processing' field or VALUE description in the original Sample records for information on what the values represent. The box plot feature on the Value distribution tab is provided to help users assess whether the distributions of values across Samples are median-centered, which is generally indicative that the data are normalized and cross-comparable.
Data type restriction: GEO2R operates on data in Series Matrix files which contain data extracted directly from the VALUE column of Sample tables. Some categories of GEO Samples do not have data tables (e.g., high-throughput sequencing or genome tiling arrays) and thus cannot be analyzed using GEO2R.
Within-Series restriction: GEO2R operates on Series Matrix files. Thus, analyses are restricted to Samples that occur within one Series; it is not possible to perform cross-Series comparisons.
Failed jobs: Occasionally, a GEO2R analysis will fail because some aspect of the input data is not compatible with the GEOquery or limma packages. In such cases, native BioConductor errors are reported.
255 Sample limit: GEO2R operates on data in Series Matrix files. These files contain a maximum of 255 Samples, thus, Series containing more than 255 Samples cannot currently be examined using GEO2R.
10 minute timeout: GEO2R currently has a 10 minute cutoff imposed on job processing. If the Series you are attempting to analyze has a large number of Samples and/or genes, the analysis may not run to completion.
Contents
- 1 Data used for this tutorial
- 2 Review data annotations and identify interesting variables
- 3 The GEO2R interface
- 4 GEO2R sample definition
- 5 Visualize the distribution of log-transformed expression values
- 6 Search for the top 250 differentially expressed transcripts
- 7 Saving the Rscript for further use in RStudio
Data used for this tutorial
In order to comply with the nice introductory video, we will use here the same GSE18388 dataset [3] selected by the NCBI trainer. The dataset information can be accessed at GEO [4].
Review data annotations and identify interesting variables
<...>
The GEO2R interface
The initial window shows several TABs that will be reviewed in the remaining of this tutorial.
GEO2R sample definition
The first step in the GEO2R analysis is performed by clicking on Define groups to setup sample groups based on available samples and label them. These groups will be used to define contrasts and compute pairwise differential expression analyses. Two groups are created with names space flown and control then samples are labeled using the mouse.
The list of samples in each group can be reviewed by clicking on List in the group definition popup window.
The order in which you assign the groups is important. First define the treated group (it will be colored in blue), then define the control group (it will be colored in pink). The order is important for calculating log fold changes later in the analysis. If you reverse the order: genes that are upregulated according to the publication that supports the data will be downregulated in your results and vice versa
Visualize the distribution of log-transformed expression values
Before proceeding with DE analysis, it is very important to compare the data distributions of the different samples in the Value distribution tab. On this tab you can generate a box plot by clicking the View button.
Since the data is supposed to be normalized you expect comparable boxes for all samples. When box plots show large divergence, it might point to the fact that the data in the Series Matrix file was not yet normalized. Unfortunately you cannot perform normalization in GEO2R. If the boxes are very different, then it is not possible to compare the samples.
Search for the top 250 differentially expressed transcripts
Since the boxplots show that the data has been normalized, we can now proceed with finding DE genes (top-250 being a good proxy for downstream analysis) between the two groups.
The default Options are shown below and are the best choice for most data sets
When satisfied, go to the GEO2R tab and click the Top250 button to run a limma analysis for identifying DE genes.
When more than two groups are defined, GEO2R selects pairwise contrasts in a triangular/circular way (depending on the number of groups). These contrasts are labelled with arbitrary names (G0, G1, ... Gn) and do not always reflect the user expectation but there is unfortunately little to be done in GEO2R to control this choice; more can be done when post-processing the code in RStudio as will be shown in the dedicated tutorial
The analysis ends with showing the 250 transcripts with the lowest p-values (ranked by increasing p-value). The table contains the following columns:
- adj.P.Val: p-value after correction for multiple testing.
This column is the primary statistic by which to interpret results. Genes with the smallest adjusted p-values will be the most reliable. Selecting all transcripts with adjusted p-values < 0.05 is equivalent to setting the False Discovery Rate (FDR) to 0,05 allowing that 5% of the selected DE genes are false positives.
As you can see GEO2R just shows the 250 genes with the lowest p-values, regardless of the significance of their p-values. So only a small fraction of these 250 genes is really DE. - P.Value: raw p-value before multiple testing correction
- t: t-statistic of the shrunken t-test
- B: B-statistic or log-odds that the gene is differentially expressed
- logFC: Log2-fold change between the two experimental conditions
This table contains links through which detailed expression information can be retrieved for interesting genes (not further detailed here).
Alternatively, additional columns can be chosen/added by tuning the 'Select columns' page
Saving the Rscript for further use in RStudio
This is the last step of this tutorial and the first step of the follow-up page Analyse GEO2R data with R and Bioconductor where we will produce more QC from the analyzed data and prepare for more advanced microarray analyses.
Simply copy the code and paste it into a blank RStudio script (or better Rmarkdown) page
Download Howto files here
References:
- ↑ https://www.ncbi.nlm.nih.gov/geo/info/geo2r.html
- ↑ https://www.ncbi.nlm.nih.gov/geo/info/geo2r.html#limitations
- ↑
Ty W Lebsack, Vuna Fa, Chris C Woods, Raphael Gruener, Ann M Manziello, Michael J Pecaut, Daila S Gridley, Louis S Stodieck, Virginia L Ferguson, Dominick Deluca
Microarray analysis of spaceflown murine thymus tissue reveals changes in gene expression regulating stress and glucocorticoid receptors.
J Cell Biochem: 2010, 110(2);372-81
[PubMed:20213684] ##WORLDCAT## [DOI] (I p) - ↑ https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE18388
[ Main_Page ]