Functional annotation and enrichment analysis

From BITS wiki
(Redirected from Exercises on Gene Ontology)
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

Gene Ontology

The Gene Ontology (GO) is a collaboration between several databases. The purpose of GO is to provide a set of standardized (everyone in the field agrees on them and uses them) terms and descriptions of biological processes, protein functions and cellular locations. A collection of such standardized terms and the relations that exist between them is called "ontology". GO has a tree structure with IS-A and PART-OF relations. You can access GO via dedicated tools (QuickGO, AmiGO...) or via other databases (UniProt...). It is used to interpret the results of high-throughput analysis (clusters, lists of coregulated genes...). By looking at the GO terms associated with the genes in a list, GO terms that are overrepresented in the list compared to the complete genome can be detected, and the function of the genes in the list might be inferred.

Searching GO using dedicated tools

The two main dedicated tools for accessing GO are QuickGO from EBI and AmiGO from GO itself.

Via QuickGO

Go to http://www.ebi.ac.uk/QuickGO/. Search for the GO biological process apoptosis. On the page containing the summary of the search results you see that the ontologies are continuously changing since a number of listed terms are obsolete.

Click the GO accession number of the first obsolete GO term.

Go to the GO record of execution phase of apoptosis.

There are 3 main relationships in GO:

  • IS-A e.g. term is a special form of another term.
  • PART-OF, e.g. phosphatidylserine exposure on apoptotic cell surface is a part of the complete execution phase of apoptosis.
  • REGULATES to indicate which proteins are regulating the execution phase of apoptosis. The latter is further subdivided into "positively regulates" and "negatively regulates".

Fortunately, you can filter the list of proteins.

Look at the first hit in the results table.

You now see the number of human annotations with experimental evidence. There should be around 90 proteins now.


Via AmiGO

Go to AmiGO. Here you can:

  • enter a gene or protein name and look for the corresponding GO terms
  • enter a GO term and search for the proteins that are linked to that term

We will first search for the GO terms of a protein

  • Genes and gene products links to a list of GO terms and protein descriptions that contain the search term
  • Annotations links to a list of GO annotations of these proteins

You can filter the results.

You see the 9 term associations of yeast PCL5 in the center window.

Now we'll do the reverse search: we will search for proteins that are involved in a GO term.


Searching GO via other databases

*via UniProt

We will search for all Uniprot entries with electron transfer activity.

Now go to UniProt and find all related proteins.

This returns an enormous amount of proteins with electron carrier activity. This high number is because:

  • You also see the results from the highly redundant TrEMBL database
  • There are many electron carrier proteins
  • You get results from all species
  • The keyword is recognized by Uniprot as a GO term and not only this term but also all child terms are included in the search

Fortunately, there are many ways to filter the Uniprot results. The first way is to distinguish high-quality proteins from Swiss-Prot and proteins obtained by translating all coding sequences in the EMBL nucleotide database (TrEMBL-translated EMBL).

In the previous exercise you have seen that Uniprot records contain GO annotation, which means that you can find the GO terms of proteins via Uniprot.


*via InterPro

There are also cross references between the InterPro database of protein domains and GO: Interpro records contain GO annotation.


*via Ensembl

Go to Ensembl. Search the human F9 (Coagulation factor IX Precursor) gene and go to the gene page (see exercises on Ensembl for details).


Pathway information

*KEGG

KEGG is a set of 16 linked databases. the basic building blocks of KEGG are proteins and chemical substances. These building blocks are combined into modules (e.g. protein complexes) and pathways. Components, modules and pathways are linked to diseases and to drugs used to cure the diseases.


STRING

String is a database containing known and predicted protein-protein interactions. It has very nice visualization of interaction networks and provides evidence for predicted interactions.

Go to the STRING website.

On the top of the results page, the interaction network is visualzed. The network nodes are proteins. The edges represent the predicted functional associations. The color of the edges reflects the evidence:

  • Red line - indicates the presence of fusion evidence
  • Green line - neighborhood evidence
  • Blue line - cooccurrence evidence
  • Purple line - experimental evidence
  • Yellow line - textmining evidence
  • Light blue line - database evidence
  • Black line - coexpression evidence.
When you scroll down you see the evidence table. You can click the dots in the table to get more details.

Initially, only the top 10 interactors will be shown. You can increase this via the Settings. Confidence scores can be interpreted as follows:

  • low confidence - 0.15 (or better)
  • medium confidence - 0.4
  • high confidence - 0.7
  • highest confidence - 0.9
There is a options the set how many interactions are shown that directly connect with your input by setting the 1st shell and how many indirect interaction that connect to a protein in the first shell by setting the 2nd shell.


*PSICQUIC

We want to know the interaction partners of human BRCA2. Searching for BRCA2 as such would result in an overview of interactions for all BRCA2, not only from human but also from mouse, rat... To make the search specific we first go to UniProt and fetch the UniProt ID of human BRCA2.

The best place to search is PSICQUIC, since it gives access to a multitude of interaction databases. It's not really a database itself, it provides on-the-fly access to the interaction databases via a software program called a web service so you always get the most up-to-date results. The interaction databases install the web service and make their content thereby accessible via PSICQUIC. Most PSICQUIC providers are using UniProt IDs to describe their proteins, this makes the results from different databases easy to combine. This is another reason to use UniProt IDs.

Go to PSICQUIC. You can see the list of databases you will search and their current status (online or offline).

As you can see in the help file PSICQUIC offers the opportunity to search in specific fields of the interaction database records.<p>

This returns many interactions but since you search many databases at the same time many of them will be identical. This is why PSICQUIC can cluster the results, removing the redundancy. Details on the clustering can be found on the help page.

You now see a list of interactions in which human BRCA2 participates: you can download this list and visualize the interaction network in Cytoscape.

What’s the biology behind a list of genes ?

Omics experiments typically generate lists of hundreds of interesting genes:

  • up- or downregulated genes identified in an RNA-Seq experiment
  • somatically mutated genes in a tumor identified by exome sequencing
  • proteins that interact with a bait identified in a proteomics experiment

Over-representation analysis

Since it’s impossible to evaluate each gene individually, the most meaningful approach is to see what functional annotations the genes in the list have in common e.g. are many of them involved in the same pathway ?

Functional characterization of a gene list involves the following steps:

  1. Add functional annotations to the genes in the list
  2. Define a background: typically the full set of all genes that were detected in the experiment
  3. Perform a statistical test to identify enriched functions, diseases, pathways

Enriched means over-represented, occurring more frequently in the list than expected by chance based on the background data.

It is recommended to characterize up- and downregulated genes separately.

!! Thousands of pathways are tested for enrichment, this could lead to false positives. Multiple testing correction is used to correct the p-values from the individual enrichment tests to reduce the chance of false positives !!

ToppGene: a very up-to-date tool for functional enrichment analysis

Yays for ToppGene

ToppGene is the most up-to-date portal for gene list functional enrichment. See this overview of their resources of functional annotations and their last update date.

The ToppFun tool returns enriched terms from GO, Mouse Phenotype, Pathways, Protein Interactions, Protein Domains, transcription factor binding sites, miRNA-target genes, disease-gene associations, drug-gene interactions, and gene expression sets, compiled from various data sources...

It supports gene symbols, Ensembl, Entrez, RefSeq and UniProt IDs from human. However, since gene symbols for human, mouse and rat are identical the tool can also be used for mouse and rat.

Nays for ToppGene

In ToppGene the background is by default the full set of all annotated genes in the genome. As a result the analysis will not control for experimental bias (bias of diseases, functions, pathways, and upstream regulators results toward genes expressed in the tissue that you are studying) and you will see enriched pathways... that are just related to the tissue or cell type you work in and not to the conditions you study.

How to do functional enrichment analysis with ToppFun ?

We will use the upregulated genes of our RNASeq training for enrichment analysis.


DAVID: most outdated but allows to set custom background

Yays for DAVID

In DAVID the background can be specified by the user. As a result the user can submit the genes that show some evidence of expression in her experiment thereby controlling for experimental bias (bias of diseases, functions, pathways, and upstream regulators results toward genes expressed in the tissue that was studied). So you will not see enriched pathways... that are just related to the tissue or cell type you work in but only those that were affected by the conditions you study.

It supports various IDs: gene symbols, Ensembl, Entrez, RefSeq and UniProt IDs from an extensive list of organisms (they claim to support 65,000 species).

Nays for DAVID

DAVID is the least up-to-date portal for gene list functional enrichment. See this overview to see when the last update was performed.

How to do enrichment analysis in DAVID

We will use the Ensembl IDs of the upregulated genes of our RNASeq training for enrichment analysis.

As a background we will use the ENSEMBL IDs of the genes that showed minimal expression in our experiment, obtained by filtering genes with less than 10 counts over all samples.


WebGestalt: all organisms but one resource at a time

Yays for WebGestalt

In WebGestalt the background can be specified by the user. As a result the user can submit the genes that show some evidence of expression in her experiment thereby controlling for experimental bias (bias of diseases, functions, pathways, and upstream regulators results toward genes expressed in the tissue that was studied). So you will not see enriched pathways... that are just related to the tissue or cell type you work in but only those that were affected by the conditions you study.

This tool largely overlaps in data-sources with DAVID but updates them more regularly, the last update was done in 2019.

It supports various IDs: gene symbols, Ensembl, Entrez, RefSeq and UniProt IDs from 12 different model organisms. For other organisms it allows to upload your own functional annotation database (see section 3.1 of the manual of this tool).

How to do enrichment analysis in WebGestalt

We will use the upregulated genes of our RNASeq training for enrichment analysis.

As a background we will use the ENSEMBL IDs of the genes that showed minimal expression in our experiment, obtained by filtering genes with less than 10 counts over all samples.

Repeat the enrichment analysis on Wiki pathways. Many more tables can be generated in WebGestalt and you should choose the type of enrichment that fits your experimental needs. Data can be saved back to disk for further use.


g:Profiler: many organisms but limited resources

Yays for g:Profiler

g:Profiler supports a long list of organisms and allows to upload your own background file.

It is very regularly updated.

Nays for g:Profiler

It has less resources than the other tools since it retrieves functional annotations from Ensembl representing GO terms, pathways, networks, regulatory motifs, and disease phenotypes.

How to do enrichment analysis in g:Profiler

We will use the upregulated genes of our RNASeq training for enrichment analysis.

As a background we will use the ENSEMBL IDs of the genes that showed minimal expression in our experiment, obtained by filtering genes with less than 10 counts over all samples.


Gene set enrichment analysis

Some omics experiments generate a ranked list of genes:

  • genes ranked by differential expression score from a RNA-Seq experiment
  • genes ranked by sensitivity in a genome-wide CRISPR screen
  • mutated genes ranked by a score from a cancer driver prediction method

To analyze these lists, the following steps are taken:

  1. The genes are divided into groups based on functional annotation (gene sets)
  2. For every group enrichment of high or low scores is calculated

Groups of related genes are called gene sets: a pathway gene set includes all genes in a pathway.

This is why this type of analysis is called GSEA, Gene Set Enrichment Analysis. It assumes a whole ranked list (after filtering genes with very low counts) as input.

GSEA

GSEA is most often done in R or via software that you install on your computer like GSEA from the Broad Institute.

GSEA is recommended when ranks are available for most of the genes in the genome (e.g. RNA-Seq data). It is not suitable when only a small portion of genes have ranks available (e.g. an experiment that identifies mutated cancer genes).

You have to install the tool on your computer. An icon will appear on your desktop.

Input files

The format of the input file is very important. It should be a tab-delimited text file where:

  • column 1 should contain gene IDs
  • column 2 should contain descriptions but may be NAs
  • next columns should contain normalized counts (one column/sample).

Columns must have headers:

  • NAME for column 1
  • Description for column 2
  • Sample names for the next columns

The first line of the file should be: #1.2
The second line should be: number_of_genes tab number_of_samples


FunGSEA_FF.png

Save the file as .gct !

Apart from these data you also need a .cls file with the metadata (grouping info of the samples). This is a space delimited text file:

  • line 1: number_of_samples space number_of_groups space 1
  • line 2: # space class0_name space class1_name
  • line 3: for every sample 0 or 1 separated by spaces


FunGSEA_FF2.png

Analysis

Originally GSEA was created to analyze microarray results but you can use it for analyzing RNA-Seq data, albeit with some tweaking of the parameter settings.

We will use the filtered genes of our RNASeq training for gene set enrichment analysis. You also need the corresponding cls file.

g:Profiler

If you only have scores for a subset of the genome you should analyze the data using g:Profiler with the Ordered query option.


FungP_Ranked.png

Your list should consist of gene IDS ordered according to decreasing importance (in this case increasing corrected p-value for differential expression).

g:Profiler performs enrichment analysis with increasingly larger numbers of genes starting from the top of the list. This procedure identifies functional annotations that associate to the most dramatic changes, as well as broader terms that characterize the gene set as a whole.

Repeat the enrichment analysis with the ranked gene list.


Resources of functional annotation

Functional annotations can be very diverse: molecular functions, pathways (genes that work together to carry out a biological process), interactions, gene regulation, involvement in disease…

Online enrichment analysis tools often have functional annotation built-in for a limited set of organisms but some tools like WebGestallt also allow to upload your own annotation.

Pathguide contains info about hundreds of pathway and molecular interaction related resources. It allows organism-based searches to find resources that contain functional info on the organism you work on.

Gene sets based on GO, pathways, omics studies, sequence motifs, chromosomal position, oncogenic and immunological expression signatures, and various computational analyses maintained by the GSEA team of MSigDB. The GSEA tool from Broad will use this database by default.


Choosing the right background

Functional enrichment methods require the definition of background genes for comparison. All annotated protein-coding genes are often used as default. This leads to false positives if the experiment measured only a subset of all genes. So you should use a custom background if the tool allows it: e.g. the filtered genes from an RNASeq experiment or all proteins that were detected in a proteomics experiment.


IPA: network analysis using highly curated data

Only for VIB members and holders of an IPA license.
Follow the instructions on our wiki page on IPA

Cytoscape: free tool for network and pathway analysis

Cytoscape is a free tool for visualizing and analyzing interaction networks and pathways.

In Cytoscape, the two basic elements of a network are nodes and edges. A node represents an individual entity in a network, such as a protein. An edge represents a relationship between two nodes. Each edge has a source and a target: the source is the node from which the edge originates, and the target is the node where the edge terminates.

* Visualizing the BRCA2 interaction network

Open Cytoscape and import the file of the BRCA2 interactions that you downloaded from PSIQUIC. To load interaction data in Cytoscape you need to indicate which column contains the source nodes and which column contains the target nodes.

The data is opened and visualized as a network


CytoC.png

iRegulon detects regulatory networks in a set of genes

Additional features are available as apps and one of the apps we're going to explore is iRegulon (see slides).

As an input, iRegulon needs a set of coregulated genes. We are going to use genes upregulated under hypoxia. You can download the set of genes from the Broad website.
Note that you need to register to download the gene set in text format.
To use the file in Cytoscape you have to remove the header lines (e.g. in WordPad) so that what remains is just a list of gene symbols.

A network of 171 unconnected nodes is created.

The results are found in the Results panel: a list of enriched motifs.


Cyto4.png

Similar motifs are represented in the same color. You can view details of the motif by clicking a row in the table.


Cyto6.png

The Transcription factors tab contains a list of candidate TFs that can bind to the enriched motifs


Cyto5.png

Using Cytoscape for network visualization

Note the typical Cytoscape vocabulary: the term "node" refers to genes and/or proteins, whereas the term "edge" is used to describe interactions between nodes in a network regardless of their type. More info in the Cytoscape documentation.