Analyze GEO data with GEO2R

From BITS wiki
Jump to: navigation, search


geo_main.gif

Find DE genes in a selected GEO dataset using GEO2R online

[ Main_Page ]


 

Use GEO2R to compare two or more groups of Samples (within a single GEO experiment) in order to identify genes that are differentially expressed across experimental conditions. Results are presented as a table of genes ordered by significance. Full instructions[1] - (Tutorial video Youtubeicon.png)

Technical.png GEO2R does not perform sophisticated statistical analysis of MA data. Please read about GEO2R limitations here [2]

Limitations and caveats

<reproduced from the GEO documentation>

The GEO database is a public repository that archives thousands of original high-throughput functional genomic studies submitted by the scientific community. These studies represent a large diversity of experimental types and designs, and contain data that are processed and normalized using a wide variety of methods. GEO2R can access and analyze almost any GEO Series, regardless of data type and quality, so the user must be aware of the following limitations and caveats.

Check that Sample values are comparable: GEO2R operates on Series Matrix files which contain data extracted directly from the VALUE column of Sample tables. Submitters are asked to supply normalized data in the VALUE column, rendering the Samples cross-comparable. The majority of GEO data do conform to this rule. GEO applies no further processing other than to perform a log2 transformation on values determined not to be in log space (see Options section). However, some studies, such as dual channel loop design data, may generate values that do not have a common reference and are not directly comparable. Some studies may contain Sample value data that are not normalized, or have a design such that the Samples were never intended to be directly compared. Yet other studies do not have sufficient replicate Samples to perform a robust statistical analysis. Users should examine the original Series to understand the experimental design, and check the 'Data processing' field or VALUE description in the original Sample records for information on what the values represent. The box plot feature on the Value distribution tab is provided to help users assess whether the distributions of values across Samples are median-centered, which is generally indicative that the data are normalized and cross-comparable.

Data type restriction: GEO2R operates on data in Series Matrix files which contain data extracted directly from the VALUE column of Sample tables. Some categories of GEO Samples do not have data tables (e.g., high-throughput sequencing or genome tiling arrays) and thus cannot be analyzed using GEO2R.

Within-Series restriction: GEO2R operates on Series Matrix files. Thus, analyses are restricted to Samples that occur within one Series; it is not possible to perform cross-Series comparisons.

Failed jobs: Occasionally, a GEO2R analysis will fail because some aspect of the input data is not compatible with the GEOquery or limma packages. In such cases, native BioConductor errors are reported.

255 Sample limit: GEO2R operates on data in Series Matrix files. These files contain a maximum of 255 Samples, thus, Series containing more than 255 Samples cannot currently be examined using GEO2R.

10 minute timeout: GEO2R currently has a 10 minute cutoff imposed on job processing. If the Series you are attempting to analyze has a large number of Samples and/or genes, the analysis may not run to completion.

 

Data used for this tutorial

Handicon.png A separate short tutorial Find GEO datasets describes how to find GEO datasets relevant to your research focus

In order to comply with the nice introductory video, we will use here the same GSE18388 dataset [3] selected by the NCBI trainer. The dataset information can be accessed at GEO [4].

Handicon.png To complete this short GEO2R tutorial, we will export all DE results using the save full result link and store this data to be used in pathway analysis tools like IPA.We will not do this here and proceed with the RStudio worflow instead.

 

Review data annotations and identify interesting variables

GSE18388_top.png

<...>


GSE18388_bottom.png

Handicon.png Note the link to Analyze with GEO2R near the bottom of the GEO page. Click on this link to open the GEO2R webtool and select the current experiment for analysis

 

The GEO2R interface

The initial window shows several TABs that will be reviewed in the remaining of this tutorial.


geo2r_start.png

 

GEO2R sample definition

The first step in the GEO2R analysis is performed by clicking on Define groups to setup sample groups based on available samples and label them. These groups will be used to define contrasts and compute pairwise differential expression analyses. Two groups are created with names space flown and control then samples are labeled using the mouse.


GEO2R1a.png


The list of samples in each group can be reviewed by clicking on List in the group definition popup window.

Handicon.png
The order in which you assign the groups is important. First define the treated group (it will be colored in blue), then define the control group (it will be colored in pink). The order is important for calculating log fold changes later in the analysis. If you reverse the order: genes that are upregulated according to the publication that supports the data will be downregulated in your results and vice versa

 

Visualize the distribution of log-transformed expression values

Before proceeding with DE analysis, it is very important to compare the data distributions of the different samples in the Value distribution tab. On this tab you can generate a box plot by clicking the View button.


GEO2R2a.png

Since the data is supposed to be normalized you expect comparable boxes for all samples. When box plots show large divergence, it might point to the fact that the data in the Series Matrix file was not yet normalized. Unfortunately you cannot perform normalization in GEO2R. If the boxes are very different, then it is not possible to compare the samples.

 

Search for the top 250 differentially expressed transcripts

Since the boxplots show that the data has been normalized, we can now proceed with finding DE genes (top-250 being a good proxy for downstream analysis) between the two groups.

Handicon.png Options can be set in the Options tab to handle log transformation and multiple testing correction to be applied to the data.

The default Options are shown below and are the best choice for most data sets

options.png

When satisfied, go to the GEO2R tab and click the Top250 button to run a limma analysis for identifying DE genes.

Technical.png When more than two grous are defined, GEO2R selects pairwise contrasts in a triangular/circular way (depending on the number of groups). These contrasts are labelled with arbitrary names (G0, G1, ... Gn) and do not always reflect the user expectation but there is unfortunately little to be done in GEO2R to control this choice; more can be done when post-processing the code in RStudio as will be shown in the dedicated tutorial


GEO2R3a.png

The analysis ends with showing the 250 transcripts with the lowest p-values (ranked by increasing p-value). The table contains the following columns:

  • adj.P.Val: p-value after correction for multiple testing.
    This column is the primary statistic by which to interpret results. Genes with the smallest adjusted p-values will be the most reliable. Selecting all transcripts with adjusted p-values < 0.05 is equivalent to setting the False Discovery Rate (FDR) to 0,05 allowing that 5% of the selected DE genes are false positives.
    As you can see GEO2R just shows the 250 genes with the lowest p-values, regardless of the significance of their p-values. So only a small fraction of these 250 genes is really DE.
  • P.Value: raw p-value before multiple testing correction
  • t: t-statistic of the shrunken t-test
  • B: B-statistic or log-odds that the gene is differentially expressed
  • logFC: Log2-fold change between the two experimental conditions

This table contains links through which detailed expression information can be retrieved for interesting genes (not further detailed here).

Handicon.png Clicking 'Save all results' will open a new window with the full table that can be saved to disk as a tab-separated text file using the browser File Save option

Alternatively, additional columns can be chosen/added by tuning the 'Select columns' page

select-columns.png

Handicon.png If you wish to upload this table to Ingenuity Pathway Analysis (IPA), you may consider opening it first in Microsoft Excel and save it back as a '.xls' file. This will remove the double quotes around fields and allow better recognition of your data by IPA

 

Saving the Rscript for further use in RStudio

This is the last step of this tutorial and the first step of the follow-up page Analyse GEO2R data with R and Bioconductor where we will produce more QC from the analyzed data and prepare for more advanced microarray analyses.

Rscript.png

Simply copy the code and paste it into a blank RStudio script (or better Rmarkdown) page

RStudio_start.png

 



Download Howto files here

Full table of DE transcripts obtained in GEO2R GSE18388_full-results.txt

References:
  1. https://www.ncbi.nlm.nih.gov/geo/info/geo2r.html
  2. https://www.ncbi.nlm.nih.gov/geo/info/geo2r.html#limitations
  3. Ty W Lebsack, Vuna Fa, Chris C Woods, Raphael Gruener, Ann M Manziello, Michael J Pecaut, Daila S Gridley, Louis S Stodieck, Virginia L Ferguson, Dominick Deluca
    Microarray analysis of spaceflown murine thymus tissue reveals changes in gene expression regulating stress and glucocorticoid receptors.
    J. Cell. Biochem.: 2010, 110(2);372-81
    [PubMed:20213684] ##WORLDCAT## [DOI] (I p)

  4. https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE18388


[ Main_Page ]