Find GEO datasets

From BITS wiki
Jump to: navigation, search


geo_main.gif

Find GEO public dataset relevant to your research

[ Main_Page ]



The NCBI GEO[1] database stores curated gene expression DataSets, as well as original Series and Platform records in the Gene Expression Omnibus (GEO) repository. Enter search terms to locate experiments of interest. DataSet records contain additional resources including cluster tools and differential expression queries.

Quick introduction to GEO datasets


1_ncbi-search.png



Search for relevant datasets by keyword

In this short example, we look for mouse experiments related to 'innate immunity'. Type in relevant keywords to restrict the findings to something biologically relevant. Inspect the results of each search until you are satisfied.


2_geo_dataset_search.png


If you are interested in specific technologies, take the opportunity to restrict your search to your favorite(s) using the 'Study Type' filter

First select types that are relevant to your needs


2b_study_type.png


Select one or more technologies to filter the results to that type(s) [clicking again clears a selection]


2c_restrict_study_type_MA.png


Technical.png Selecting 'Expression profiling by array' in the left menu restricts the search to 'microarray' experiments (reversely, selecting 'Expression profiling by high throughput sequencing' gives you RNASeq data)

You can add additional filters to the left menu, allowing you to further refine your search in order to better evaluate the supporting data (e.g. sample count)


3_add_content.png


We often wish to analyze data in a tissue-specific context, this is possible after selecting the tissue filter in the left menu, to restrict the search to experiments for which tissue annotation is available (the scientists who did the experiment have mentioned the tissue they used in the experiment description).


4_add_tissue_requirement.png


Handicon.png Maybe the most important action here is to sort the results by decreasing sample counts. The more samples are present, the higher the confidence will be after computing differential expression


5_sort_decreasing_sample-count.png


Handicon.png Inspect the other optional filters, they could be very relevant to your own focus

Handicon.png Be curious and look at other filters that could even improve this search for your own needs



Select a winner and proceed

The GSE14675 dataset seems to fit to the selected biological focus and is selected.


6_choose_one.png


Click the link: Identification of Hedgehog Signaling and Novel Transcription Factors Involved in Regulation of Systemic Response to LPS to inspect the annotations associated with this data in order to choose samples to be used for the next step.


7a_show_dataset_page.png



Get more info from the publication

The page also links to Pubmed when the GEO data was part of a publication.


7b_citation.png


Downloading the original paper [4] is often a good idea. It allows reading what was deduced from this data and make sure there is no hidden information that would make this dataset unfit for our aims.


7c_download_publication.png


Note that you can also download the data. GEO offers microarray data in two formats:

  • a set of CEL files containing the raw data
  • a single Series Matrix (.txt) file containing the normalized data for all samples. Many software packages to analyze microarray data, like GEO2R and the CLC Main Workbench start from these files.

    Handicon.png
    It's important to realize that these data are supposed to be already normalized and log-transformed. However, this is not always the case: some submitters provide raw data in these files, some designs do not allow for regular normalization...


GEOData.png


Link Out to GEO2R and perform differential expression analysis online

Lower in the dataset page, a link to GEO2R allows proceeding with online differential expression analysis.


7d_bottom_geo2R-link.png


The GEO2R page opens with sample descriptions organized in a table.


8_start_GEO2R.png


Each column header of the table can be clicked to re-sort this table according to annotation and group samples. We can also see that several organs are available (remember that we asked tissues to be represented in the datasets) as well as several mouse genetic backgrounds. If you have reasons to prefer one or the other, it is now possible.


9_sort_columns.png


We are now ready to apply GEO2R to some or all samples. The use of GEO2R and post-processing in [RStudio] are detailed in follow-up tutorials on this wiki.



References:
  1. Tanya Barrett, Stephen E Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F Kim, Maxim Tomashevsky, Kimberly A Marshall, Katherine H Phillippy, Patti M Sherman, Michelle Holko, Andrey Yefanov, Hyeseung Lee, Naigong Zhang, Cynthia L Robertson, Nadezhda Serova, Sean Davis, Alexandra Soboleva
    NCBI GEO: archive for functional genomics data sets--update.
    Nucleic Acids Res.: 2013, 41(Database issue);D991-5
    [PubMed:23193258] ##WORLDCAT## [DOI] (I p)

  2. http://www.ncbi.nlm.nih.gov/geo/info/
  3. http://www.ncbi.nlm.nih.gov/gds/
  4. Ivana V Yang, Scott Alper, Brad Lackford, Holly Rutledge, Laura A Warg, Lauranell H Burch, David A Schwartz
    Novel regulators of the systemic response to lipopolysaccharide.
    Am. J. Respir. Cell Mol. Biol.: 2011, 45(2);393-402
    [PubMed:21131441] ##WORLDCAT## [DOI] (I p)



[ Main_Page ]