Functional annotation and enrichment analysis
- 1 Gene Ontology
- 2 Pathway information
- 3 ToppGene: the best tool for functional enrichment analysis
- 4 DAVID: using GO / pathway annotation to interpret gene lists
- 5 Enrichr: new tool for enrichment analysis
- 6 WebGestalt: new tool for enrichment analysis
- 7 IPA: network analysis using highly curated data
- 8 Cytoscape: free tool for network and pathway analysis
The Gene Ontology (GO) is a collaboration between several databases. The purpose of GO is to provide a set of standardized (everyone in the field agrees on them and uses them) terms and descriptions of biological processes, protein functions and cellular locations. A collection of such standardized terms and the relations that exist between them is called "ontology". GO has a tree structure with IS-A and PART-OF relations. You can access GO via dedicated tools (QuickGO, AmiGO...) or via other databases (UniProt...). It is used to interpret the results of high-throughput analysis (clusters, lists of coregulated genes...). By looking at the GO terms associated with the genes in a list, GO terms that are overrepresented in the list compared to the complete genome can be detected, and the function of the genes in the list might be inferred.
Searching GO using dedicated tools
The two main dedicated tools for accessing GO are QuickGO from EBI and AmiGO from GO itself.
Go to http://www.ebi.ac.uk/QuickGO/. Search for the GO biological process apoptosis. On the page containing the summary of the search results you see that the ontologies are continuously changing since a number of listed terms are obsolete.
Click the GO accession number of the first obsolete GO term.
|Which term replaced the obsolete term|
To know which term replaced the outdated one scroll down to the Replaced by section.</p>
Go to the GO record of execution phase of apoptosis.
|What are the relationships that exist between execution phase of apoptosis and its child terms ?|
|Scroll down to the Child Terms section. These are all more specific terms that are linked to execution phase of apoptosis.|
There are 3 main relationships in GO:
- IS-A e.g. term is a special form of another term.
- PART-OF, e.g. phosphatidylserine exposure on apoptotic cell surface is a part of the complete execution phase of apoptosis.
- REGULATES to indicate which proteins are regulating the execution phase of apoptosis. The latter is further subdivided into "positively regulates" and "negatively regulates".
|How many proteins are linked to execution phase of apoptosis and its descendant terms ?|
|On the top of the page you see that there are 4422 annotations (proteins) linked to apoptosis and its child terms. Click the link. GO provides evidence for linking proteins to terms and references to where it has found this evidence.|
Fortunately, you can filter the list of proteins.
|How many human proteins are linked to execution phase of apoptosis and its descendant terms ?|
You now see the number of human annotations and the number of human proteins linked to execution phase of apoptosis and its child terms. There should be around 130 proteins linked to execution phase of apoptosis.
Look at the first hit in the results table.
|What does its evidence code mean ?|
To see what it means click the code.
|Retrieve only manually asserted results ?|
You now see the number of human annotations with experimental evidence. There should be around 90 proteins now.
|Download these final results ?|
Click the Export button. This opens a file where you can select the format you want to save in.
Go to AmiGO. Here you can:
- enter a gene or protein name and look for the corresponding GO terms
- enter a GO term and search for the proteins that are linked to that term
We will first search for the GO terms of a protein
|Search for cyclin PCL5|
In the search box type PCL5 and click Search.
- Genes and gene products links to a list of GO terms and protein descriptions that contain the search term
- Annotations links to a list of GO annotations of these proteins
|Find the GO annotations of cyclin PCL5|
Click the Annotations button.
You can filter the results.
|Find the GO annotations of yeast cyclin PCL5|
Expand the Organism select box and click the + sign after Saccharomyces cerevisiae.
You see the 9 term associations of yeast PCL5 in the center window.
Now we'll do the reverse search: we will search for proteins that are involved in a GO term.
|Retrieve all gene products related to execution phase of apoptosis|
Searching GO via other databases
We will search for all Uniprot entries with electron transfer activity.
|Get the GO accession number of electron transfer activity|
|Go to AmiGO and fetch the GO accession number of electron transfer activity:|
Type electron transfer activity in the search box. The GO accession number of electron carrier activity is GO:0009055.
Now go to UniProt and find all related proteins.
|Search for all Uniprot entries with electron transfer activity.|
This returns an enormous amount of proteins with electron carrier activity. This high number is because:
- You also see the results from the highly redundant TrEMBL database
- There are many electron carrier proteins
- You get results from all species
- The keyword is recognized by Uniprot as a GO term and not only this term but also all child terms are included in the search
|Go to the UniProt record of the first hit and check if it has electron carrier activity|
Click the UniProt ID in the search results to go to the Uniprot record of this protein.
You are redirected to the QuickGO record for this function.
Fortunately, there are many ways to filter the Uniprot results. The first way is to distinguish high-quality proteins from Swiss-Prot and proteins obtained by translating all coding sequences in the EMBL nucleotide database (TrEMBL-translated EMBL).
|Search all high quality Uniprot-SwissProt entries with electron carrier activity|
In the Filter section you can choose to only view the Reviewed proteins which are obtained from SwissProt.
|Search for all human high quality Uniprot-SwissProt entries with electron carrier activity|
In the Filter section you can choose to only view the Human proteins which are obtained from SwissProt. Note that the Reviewed filter is still active.
In the previous exercise you have seen that Uniprot records contain GO annotation, which means that you can find the GO terms of proteins via Uniprot.
There are also cross references between the InterPro database of protein domains and GO: Interpro records contain GO annotation.
|Get the terms assigned to the retinol-binding domain (IPR002449).|
The domain is involved in transport and can bind to retinoid.
|What are the biological processes that the protein encoded by this gene is involved in ?|
|On the gene page, go to the left menu and click GO: Biological process in the Ontologies section. This opens the GO annotation on the gene page. When you scroll down you see that the protein is involved in proteolysis and blood coagulation.|
|What are the molecular functions of the protein encoded by this gene ?|
|Click GO: Molecular function in the Ontologies section. You see that the protein can bind other proteins and calcium ions and has endopeptidase activity.|
KEGG is a set of 16 linked databases. the basic building blocks of KEGG are proteins and chemical substances. These building blocks are combined into modules (e.g. protein complexes) and pathways. Components, modules and pathways are linked to diseases and to drugs used to cure the diseases.
|Look at a map of the prion disease pathway from KEGG.|
|On the KEGG home page, you see a simple text search box: type prion in the box and click Search.|
You are redirected to a results overview page : showing the resultss of your search in each of KEGG's 16 databases. Click on the result of the KEGG Pathways database.
|What are the names of the chemical compounds that are related to the prion disease pathway according to KEGG ?|
|Go back to the KEGG Pathway record and click "KEGG COMPOUND (3)" in the right menu.|
String is a database containing known and predicted protein-protein interactions. It has very nice visualization of interaction networks and provides evidence for predicted interactions.
Go to the STRING website.
|How to find the interaction network of a protein ?|
On the top of the results page, the interaction network is visualzed. The network nodes are proteins. The edges represent the predicted functional associations. The color of the edges reflects the evidence:
- Red line - indicates the presence of fusion evidence
- Green line - neighborhood evidence
- Blue line - cooccurrence evidence
- Purple line - experimental evidence
- Yellow line - textmining evidence
- Light blue line - database evidence
- Black line - coexpression evidence.
Initially, only the top 10 interactors will be shown. You can increase this via the Settings. Confidence scores can be interpreted as follows:
- low confidence - 0.15 (or better)
- medium confidence - 0.4
- high confidence - 0.7
- highest confidence - 0.9
We want to know the interaction partners of human BRCA2. Searching for BRCA2 as such would result in an overview of interactions for all BRCA2, not only from human but also from mouse, rat... To make the search specific we first go to UniProt and fetch the UniProt ID of human BRCA2.
The best place to search is PSICQUIC, since it gives access to a multitude of interaction databases. It's not really a database itself, it provides on-the-fly access to the interaction databases via a software program called a web service so you always get the most up-to-date results. The interaction databases install the web service and make their content thereby accessible via PSICQUIC. Most PSICQUIC providers are using UniProt IDs to describe their proteins, this makes the results from different databases easy to combine. This is another reason to use UniProt IDs.
|what is the UniProt identifier of human BRCA2?|
|Search for BRCA2 in UniProt: the identifier of the human copy is P51587.|
Go to PSICQUIC. You can see the list of databases you will search and their current status (online or offline).
As you can see in the help file PSICQUIC offers the opportunity to search in specific fields of the interaction database records.<p>
|Search for records in all online databases with BRCA2 in one of the two identifier fields.|
|Search for id: P51587 in PSICQUIC.|
This returns many interactions but since you search many databases at the same time many of them will be identical. This is why PSICQUIC can cluster the results, removing the redundancy. Details on the clustering can be found on the help page.
|View the unique interactions.|
|On the bottom of the results page click the Cluster this query button. Clustering will take a few moments, once it is done you can click the view link.|
You now see a list of interactions in which human BRCA2 participates: you can download this list and visualize the interaction network in Cytoscape.
ToppGene: the best tool for functional enrichment analysisToppGene is the most up-to-date portal for gene list functional enrichment.
The ToppFun tool returns enriched terms from GO, Mouse Phenotype, Pathways, Protein Interactions, Protein Domains, transcription factor binding sites, miRNA-target genes, disease-gene associations, drug-gene interactions, and gene expression sets, compiled from various data sources...
|How to do functional enrichment analysis with ToppFun ?|
|How to visualize the results of the functional enrichment analysis with ToppFun|
Click the Display chart link
DAVID: using GO / pathway annotation to interpret gene lists
Download the following file with gene names of DE genes from a microarray experiment in rat:
Microarrays were used to compare gene expression between left heart ventricle and diaphragm. So you expect to see enrichment of heart-related processes.
Follow the instructions on our wiki section on DAVID.
Enrichr: new tool for enrichment analysis
We will use the same file as in the previous exercise:
Follow the instructions on our wiki section on Enrichr.
WebGestalt: new tool for enrichment analysis
We will use a file containing Affymetrix IDs from the same microarray experiment as in the previous exercise:
Follow the instructions on our wiki section on WebGestalt.
IPA: network analysis using highly curated data
Only for VIB members and holders of an IPA license.
Follow the instructions on our wiki page on IPA
Cytoscape: free tool for network and pathway analysis
Cytoscape is a free tool for visualizing and analyzing interaction networks and pathways.
In Cytoscape, the two basic elements of a network are nodes and edges. A node represents an individual entity in a network, such as a protein. An edge represents a relationship between two nodes. Each edge has a source and a target: the source is the node from which the edge originates, and the target is the node where the edge terminates.
* Visualizing the BRCA2 interaction network
Open Cytoscape and import the file of the BRCA2 interactions that you downloaded from PSIQUIC. To load interaction data in Cytoscape you need to indicate which column contains the source nodes and which column contains the target nodes.
|Import the data of the BRCA2 interactions that you downloaded from PSIQUIC into Cytoscape|
The data is opened and visualized as a network
iRegulon detects regulatory networks in a set of genes
Additional features are available as apps and one of the apps we're going to explore is iRegulon (see slides).
As an input, iRegulon needs a set of coregulated genes. We are going to use genes upregulated under hypoxia. You can download the set of genes from the Broad website.
Note that you need to register to download the gene set in text format.
To use the file in Cytoscape you have to remove the header lines (e.g. in WordPad) so that what remains is just a list of gene symbols.
|Open the data in Cytoscape|
A network of 171 unconnected nodes is created.
|Predict regulators and their targets|
The results are found in the Results panel: a list of enriched motifs.
Similar motifs are represented in the same color. You can view details of the motif by clicking a row in the table.
The Transcription factors tab contains a list of candidate TFs that can bind to the enriched motifs
Using Cytoscape for network visualization
Note the typical Cytoscape vocabulary: the term "node" refers to genes and/or proteins, whereas the term "edge" is used to describe interactions between nodes in a network regardless of their type. More info in the Cytoscape documentation.