PubMA Exercise.5

From BITS wiki
Jump to: navigation, search

Functional enrichment of the obtained lists to identify key biological functions

[ Main_Page | Hands-on Analysis of public microarray datasets | PubMA_Exercise.4 | PubMA_Exercise.5 |
| PubMA_Exercise.6 ]


Why we are not yet done!

Once upon a time, scientists worked for a lifetime on one or few genes, they read all what came published about their favorite proteins and did not need any computer to help them follow up and understand published data.

This time now belongs to the past and modern biologists need to cope with publication frequency far higher than their reading speed. happy or not, they have to rely on computers for some of the tasks they used to do manually.

Analysis of MA data, similarly to any other high throughput technology, generates thousands of lines of results out of which hundreds are statistically significant. It is therefore very unlikely that the biologist evaluate each line and identify the proteins (genes products) that are significantly altered in expression and may be responsible for the biology under investigation.

The approach consisting in recognizing genes in the list and selecting them for validation may seem appropriate but will unlikely lead to any discoveries. As the main need for publication is to find novelty, this method is pretty much useless.

A better way to analyze and prioritize targets from a screen is to consider the biological functions and pathways that include [are-enriched-in] differentially expressed genes. This can be done after adding ontology annotations to the data and using these added column to identify 'functions', 'diseases', 'pathways' or any ontologies terms that are enriched in the DE-set et as compared to the full set of genes measured by the platform. Again, this apparently straightforward statistical testing can be quite lengthy if you consider hundreds of available ontologies and hundreds to thousand genes to annotate.

Hypergeometric T-test and Gene set enrichment analysis (GSEA) are the two mainly used statistical approaches to identify enrichment based on gene lists. A number of standalone and Web tools implement these methods and is falls beyond the scope of this training to list them all or to argue for one or another. We instead present a few alternative tools that will accept the data obtained in the former exercises and process it to find enriched ontology terms.

Functional enrichment Analysis of the RobiNA DE-data

In order to continue with the most complete and easy workflow, we use here the RobiNA table obtained by de-novo analysis of 11 of the 12 samples (one CEL file being damaged on the GEO repository)

Preparing probe lists for enrichment testing

Web tools will require probe or gene lists to compute enrichment, they will not take into account the degree of DE or the confidence in that DE both of which are left to the user to filter.

We can produce these two lists using Excel (better would be R) in few easy steps

  • import the table in excel, taking care of protecting gene symbols against interpretation
  • add a column with absolute value of logFC
  • filter on the abs(logFC) with a minimal cutoff of 2 (four fold DE)
abs-logFC-filter.png
  • filter on the 'adj.P.Val' column with a maximum of 0.001
fdr-filter.png

  The list of 15924 probes is reduced to 301 by this double filtering

DE-probes.png

DAVID: father of enrichment tools

The canonical web tool is DAVID[1]. DAVID has been around since 1997 and is stil very popular although its interface is quite outdated. A recent nature Protocol paper[2] will help you start using DAVID.

In order to use DAVID, we need to split our data in two groups:

  • genes (probe IDs) considered differentially expressed (TEST)
  • the remaining of the genes (probes) present on the platform (BACKGROUND) - note that the DAVID 'Background' tab allows selecting actual MA chips

Technical.png The choice of background is important to obtain good results and ensures that only functions represented by genes measured in the experiment will be considered, although this is normally true with modern platforms where almost all known genes are present

On the homepage select the Functional annotation tool.

If you use Gene symbols, the software will not know which organism to choose: the same gene symbols are used in human, mouse and rat so you will have to specify the organism. If you use other IDs DAVID will automatically determine the source of the genes.

By default, DAVID chooses the full genome as background and not the genes that are represented on a microarray.

DAVID not only uses GO but also pathway annotation, annotation from swissprot, protein domains and motifs... You can get a summary of all the results by clicking the Functional annotation clustering button at the bottom of the page.
Results are now combined into groups of related terms (regardless of the ontology they come from)


DAVID7b.png

If you want to view the genes from your list that are associated to one of the enriched annotations simply click the blue bar:


DAVID7d.png


Current web-based Enrichment-tools

We chose to show you only two recent tools in this section. Many others are available and you are welcome to try them and share your experience back with us.

Enrich

Enrichr [3] and its associated tool Lists2Networks [4] use a respectable number of sources to compute enrichment (source-list: http://amp.pharm.mssm.edu/Enrichr/index.html#stats). To learn more about Enrich, please refer to their FAQ page.

Enrichr uses a list of Entrez gene symbols as input. Each symbol in the input must be on its own line. You can upload the list by either selecting the text file that contains the list or just simply pasting in the list into the text box. We will use the gene symbols extracted from the RobiNA file, cleaned to remove duplicates, and where double-ID lines (a probeset mapping to two distinct genes) were expanded. This input file is available on the BITS server (link).

The results page consists of multiple tabs, each tab giving an overview of a specific type of annotation (Transcription, Pathways, Ontologies...).

The bar charts are interactive and provide much more info than these pictures. Hover your mouse over the bars to see the enrichment scores. Clicking the bars will order the terms according to different scores.

The length of the bar represents the significance of that specific gene-set or term. In addition, the brighter the color, the more significant that term is.
As you can see, Enrichr implements three approaches to compute enrichment scores:

  • The p-value comes from a standard method implemented in most enrichment analysis tools: the Fisher exact test.
  • The z-score comes from a correction to the Fisher exact test. First EnrichR computes enrichment using the Fisher exact test for many random gene sets in order to compute mean and standard deviation of the expected rank for each term in the annotation library. Then they compute a z-score reflecting the deviation of the actual rank from this expected rank.
  • Thirdly, they combined the p-value of the Fisher exact test with the z-score of the second test into a q-score.
Enrichr provides all three options for sorting enriched terms.

The adjusted p-value is the p-value of the Fisher exact test withmutiple testing correction according to Benjamini Hochberg.

Look at the results for GO Biological processes, OMIM disease, and TargetScan miRNAs.

Each node represents a term (in this case a TF) and a link between two nodes means that the two terms (TF) have some genes from your list in common, meaning that there are genes that are linked to both TFs. In this example it points to TFs that share target genes from your list so they might interact to regulate the process you're studying.

Webgestalt

This second tool largely overlaps in data-sources with Enrich although its tabular reporting format makes it a little less attractive to my eyes. WebGestalt accepts many ID types and allows selecting the exact background based on the array which is a plus as compared to Enrich and puts it even with DAVID in that respect.

WebGestalt [5] is a "WEB-based GEne SeT AnaLysis Toolkit". It is designed for functional genomic, proteomic and large-scale genetic studies from which large number of gene lists (e.g. differentially expressed gene sets, co-expressed gene sets etc) are continuously generated. WebGestalt incorporates information from different public resources and provides an easy way for biologists to make sense out of gene lists.

Please read the full manual for a good introduction to this tool.

WebGestalt accepts probe IDs from microarrays. The probe-list obtained from the RobiNA results of the affy_rae230a rat microarray is available on the BITS server (link) and its content can be used on the WebGestalt page.

The results are shown in different tabs. The Enrichment results tab shows a table with annotations and scores as well as a list of genes responsible for the enrichment for each enriched KEGG pathways

Repeat the enrichment analysis on Wiki pathways.

Again, many more tables can be generated in WebGestalt and you should choose the type of enrichment that fits your experimental needs. Data can be saved back to disk for further use.

Conclusion

More complete analysis can be performed by those few who can program in [R]-Bioconductor. As this session is aimed at Biologists, this option is not further discussed. Users can also consider using commercial packages like Ingenuity pathway Analysis that provide much more detailed and rich information than what free tools can offer.

download exercise files

Download exercise files here.

Use the right application to open the files present in ex5-files

References:
  1. http://david.abcc.ncifcrf.gov
  2. http://www.nature.com/nprot/journal/v4/n1/abs/nprot.2008.211.html
  3. http://amp.pharm.mssm.edu/Enrichr/

    Edward Y Chen, Christopher M Tan, Yan Kou, Qiaonan Duan, Zichen Wang, Gabriela Vaz Meirelles, Neil R Clark, Avi Ma'ayan
    Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool.
    BMC Bioinformatics: 2013, 14;128
    [PubMed:23586463] ##WORLDCAT## [DOI] (I e)

  4. Alexander Lachmann, Avi Ma'ayan
    Lists2Networks: integrated analysis of gene/protein lists.
    BMC Bioinformatics: 2010, 11;87
    [PubMed:20152038] ##WORLDCAT## [DOI] (I e)

  5. http://bioinfo.vanderbilt.edu/webgestalt/

    Stefan Kirov, Ruiru Ji, Jing Wang, Bing Zhang
    Functional annotation of differentially regulated gene set using WebGestalt: a gene set predictive of response to ipilimumab in tumor biopsies.
    Methods Mol Biol: 2014, 1101;31-42
    [PubMed:24233776] ##WORLDCAT## [DOI] (I p)

    Jing Wang, Dexter Duncan, Zhiao Shi, Bing Zhang
    WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013.
    Nucleic Acids Res: 2013, 41(Web Server issue);W77-83
    [PubMed:23703215] ##WORLDCAT## [DOI] (I p)

    Bing Zhang, Stefan Kirov, Jay Snoddy
    WebGestalt: an integrated system for exploring gene sets in various biological contexts.
    Nucleic Acids Res: 2005, 33(Web Server issue);W741-8
    [PubMed:15980575] ##WORLDCAT## [DOI] (I p)

    http://bioinfo.vanderbilt.edu/webgestalt/WebGestalt_manual_2013_04_12.pdf
    

[ Main_Page | Hands-on Analysis of public microarray datasets | PubMA_Exercise.4 | PubMA_Exercise.5 |
| PubMA_Exercise.6 ]