Functional annotation and enrichment analysis

From BITS wiki
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

Gene Ontology

The Gene Ontology (GO) is a collaboration between several databases. The purpose of GO is to provide a set of standardized (everyone in the field agrees on them and uses them) terms and descriptions of biological processes, protein functions and cellular locations. A collection of such standardized terms and the relations that exist between them is called "ontology". GO has a tree structure with IS-A and PART-OF relations. You can access GO via dedicated tools (QuickGO, AmiGO...) or via other databases (UniProt...). It is used to interpret the results of high-throughput analysis (clusters, lists of coregulated genes...). By looking at the GO terms associated with the genes in a list, GO terms that are overrepresented in the list compared to the complete genome can be detected, and the function of the genes in the list might be inferred.

Searching GO using dedicated tools

The two main dedicated tools for accessing GO are QuickGO from EBI and AmiGO from GO itself.

Via QuickGO

Go to Search for the GO biological process apoptosis. On the page containing the summary of the search results you see that the ontologies are continuously changing since a number of listed terms are obsolete.

Click the GO accession number of the first obsolete GO term.

Go to the GO record of execution phase of apoptosis.

There are 3 main relationships in GO:

  • IS-A e.g. term is a special form of another term.
  • PART-OF, e.g. phosphatidylserine exposure on apoptotic cell surface is a part of the complete execution phase of apoptosis.
  • REGULATES to indicate which proteins are regulating the execution phase of apoptosis. The latter is further subdivided into "positively regulates" and "negatively regulates".

Fortunately, you can filter the list of proteins.

Look at the first hit in the results table.

You now see the number of human annotations with experimental evidence. There should be around 90 proteins now.

Via AmiGO

Go to AmiGO. Here you can:

  • enter a gene or protein name and look for the corresponding GO terms
  • enter a GO term and search for the proteins that are linked to that term

We will first search for the GO terms of a protein

  • Genes and gene products links to a list of GO terms and protein descriptions that contain the search term
  • Annotations links to a list of GO annotations of these proteins

You can filter the results.

You see the 9 term associations of yeast PCL5 in the center window.

Now we'll do the reverse search: we will search for proteins that are involved in a GO term.

Searching GO via other databases

*via UniProt

We will search for all Uniprot entries with electron transfer activity.

Now go to UniProt and find all related proteins.

This returns an enormous amount of proteins with electron carrier activity. This high number is because:

  • You also see the results from the highly redundant TrEMBL database
  • There are many electron carrier proteins
  • You get results from all species
  • The keyword is recognized by Uniprot as a GO term and not only this term but also all child terms are included in the search

Fortunately, there are many ways to filter the Uniprot results. The first way is to distinguish high-quality proteins from Swiss-Prot and proteins obtained by translating all coding sequences in the EMBL nucleotide database (TrEMBL-translated EMBL).

In the previous exercise you have seen that Uniprot records contain GO annotation, which means that you can find the GO terms of proteins via Uniprot.

*via InterPro

There are also cross references between the InterPro database of protein domains and GO: Interpro records contain GO annotation.

*via Ensembl

Go to Ensembl. Search the human F9 (Coagulation factor IX Precursor) gene and go to the gene page (see exercises on Ensembl for details).

Pathway information


KEGG is a set of 16 linked databases. the basic building blocks of KEGG are proteins and chemical substances. These building blocks are combined into modules (e.g. protein complexes) and pathways. Components, modules and pathways are linked to diseases and to drugs used to cure the diseases.


String is a database containing known and predicted protein-protein interactions. It has very nice visualization of interaction networks and provides evidence for predicted interactions.

Go to the STRING website.

On the top of the results page, the interaction network is visualzed. The network nodes are proteins. The edges represent the predicted functional associations. The color of the edges reflects the evidence:

  • Red line - indicates the presence of fusion evidence
  • Green line - neighborhood evidence
  • Blue line - cooccurrence evidence
  • Purple line - experimental evidence
  • Yellow line - textmining evidence
  • Light blue line - database evidence
  • Black line - coexpression evidence.
When you scroll down you see the evidence table. You can click the dots in the table to get more details.

Initially, only the top 10 interactors will be shown. You can increase this via the Settings. Confidence scores can be interpreted as follows:

  • low confidence - 0.15 (or better)
  • medium confidence - 0.4
  • high confidence - 0.7
  • highest confidence - 0.9
There is a options the set how many interactions are shown that directly connect with your input by setting the 1st shell and how many indirect interaction that connect to a protein in the first shell by setting the 2nd shell.


We want to know the interaction partners of human BRCA2. Searching for BRCA2 as such would result in an overview of interactions for all BRCA2, not only from human but also from mouse, rat... To make the search specific we first go to UniProt and fetch the UniProt ID of human BRCA2.

The best place to search is PSICQUIC, since it gives access to a multitude of interaction databases. It's not really a database itself, it provides on-the-fly access to the interaction databases via a software program called a web service so you always get the most up-to-date results. The interaction databases install the web service and make their content thereby accessible via PSICQUIC. Most PSICQUIC providers are using UniProt IDs to describe their proteins, this makes the results from different databases easy to combine. This is another reason to use UniProt IDs.

Go to PSICQUIC. You can see the list of databases you will search and their current status (online or offline).

As you can see in the help file PSICQUIC offers the opportunity to search in specific fields of the interaction database records.<p>

This returns many interactions but since you search many databases at the same time many of them will be identical. This is why PSICQUIC can cluster the results, removing the redundancy. Details on the clustering can be found on the help page.

You now see a list of interactions in which human BRCA2 participates: you can download this list and visualize the interaction network in Cytoscape.

ToppGene: the best tool for functional enrichment analysis

ToppGene is the most up-to-date portal for gene list functional enrichment.

The ToppFun tool returns enriched terms from GO, Mouse Phenotype, Pathways, Protein Interactions, Protein Domains, transcription factor binding sites, miRNA-target genes, disease-gene associations, drug-gene interactions, and gene expression sets, compiled from various data sources...

DAVID: using GO / pathway annotation to interpret gene lists

Download the following file with gene names of DE genes from a microarray experiment in rat:

Microarrays were used to compare gene expression between left heart ventricle and diaphragm. So you expect to see enrichment of heart-related processes. Follow the instructions on our wiki section on DAVID.

Enrichr: new tool for enrichment analysis

We will use the same file as in the previous exercise:

Follow the instructions on our wiki section on Enrichr.

WebGestalt: new tool for enrichment analysis

We will use a file containing Affymetrix IDs from the same microarray experiment as in the previous exercise:

Follow the instructions on our wiki section on WebGestalt.

IPA: network analysis using highly curated data

Only for VIB members and holders of an IPA license.
Follow the instructions on our wiki page on IPA

Cytoscape: free tool for network and pathway analysis

Cytoscape is a free tool for visualizing and analyzing interaction networks and pathways.

In Cytoscape, the two basic elements of a network are nodes and edges. A node represents an individual entity in a network, such as a protein. An edge represents a relationship between two nodes. Each edge has a source and a target: the source is the node from which the edge originates, and the target is the node where the edge terminates.

* Visualizing the BRCA2 interaction network

Open Cytoscape and import the file of the BRCA2 interactions that you downloaded from PSIQUIC. To load interaction data in Cytoscape you need to indicate which column contains the source nodes and which column contains the target nodes.

The data is opened and visualized as a network


iRegulon detects regulatory networks in a set of genes

Additional features are available as apps and one of the apps we're going to explore is iRegulon (see slides).

As an input, iRegulon needs a set of coregulated genes. We are going to use genes upregulated under hypoxia. You can download the set of genes from the Broad website.
Note that you need to register to download the gene set in text format.
To use the file in Cytoscape you have to remove the header lines (e.g. in WordPad) so that what remains is just a list of gene symbols.

A network of 171 unconnected nodes is created.

The results are found in the Results panel: a list of enriched motifs.


Similar motifs are represented in the same color. You can view details of the motif by clicking a row in the table.


The Transcription factors tab contains a list of candidate TFs that can bind to the enriched motifs


Using Cytoscape for network visualization

Note the typical Cytoscape vocabulary: the term "node" refers to genes and/or proteins, whereas the term "edge" is used to describe interactions between nodes in a network regardless of their type. More info in the Cytoscape documentation.