Gene set enrichment analysis

From BITS wiki
Jump to: navigation, search

[ Overviews | Main_Page ]

Biologically interpreting a list of genes, obtained with any method, is the major aim of a gene set analysis, or also called gene set enrichment analysis. As an alternative by sifting through the list manually, with this method the researcher looks for the overrepresentation of a set of genes. The genes in a such set can share any property of biological significance, such as belonging to a certain gene ontology category, or being part of a chromosome, or... It is up to the researcher to define meaningful sets which he/she can test to his/her list.

Three major methods exist to test overrepresented sets.

Contingency tables


Requires a list of gene ids, and a background to which to test the set of genes (e.g. all genes of that species). Typically, the gene id list is a result of a test, on which a predefined cut-off is applied (e.g. alpha of 0.05). Broadly speaking, a contingency table method compares the fraction of a gene list that belongs to a gene set (e.g. 10 lipid catabolism genes in the gene list of 250 genes), with the fraction of this gene set in the background (e.g. the genome contains 21 000 genes of which 560 are lipid catabolism genes). Different stastical tests, e.g. Fisher's Extact test, can compare these numbers and output a statistical result.




Network view

  • BiNGO - Cytoscape plugin
  • ClueGO - Cytoscape plugin, integrates Gene Ontology (GO) terms as well as KEGG/BioCarta pathways


This method depends on a predefined threshold, for generating the 'significantly differing' gene list on which to perform the analysis. This requirement for an arbitrarily user-defined threshold (for example, genes with p-value >0.01 are called not-significant) causes an loss of information and influences the final result of the contingency table based gene set analysis.

To demonstrate this, we uploaded a dataset in Ingenuity Pathway analysis. From this set we extracted 7 different gene lists, corresponding to genes with a p-value < 10^-8, one list with < 10^-7, etc. up to genes with a p-value < 10^-2. Correspondingly, the gene list differed in size, ranging from 126 ids (10^-8 cut-off list) to ~1600 ids (10^-2 cut-off list). The dataset can be downloaded from the BITS website (dyslipidemia microarray data set).

IPA can detect differential regulation of pathways: when comparing the 7 different gene lists, the subsequent inclusion of genes to the list caused differing results of regulated pathways. The screenshots below show this different behaviour by subsequently sorting the pathways by p-values in the different lists.

Ipa cpc1.pngIpa cpc2.pngIpa cpc3.pngIpa cpc4.png
Ipa cpc5.pngIpa cpc6.pngIpa cpc7.png

There is no good solution of this problem. Comparing different thresholds as done above can however shed light on the impact of the arbitrarily chosen cut-off for generating the gene list.

Using raw expression data

  • Goeman's global test
  • Hotelling's T²

Using gene-level statistics


These methods take as input all genes measured with their associate statistic, or significance result (e.g. p-value) or expression values. In this way, this method avoid having to set a predefined cut-off, as is the case with contingency tables (see above).


  • GSEA
  • GAGE - detection of sets with up- and down-regulated genes simultaneously
  • PIANO - a metapackage in R that combines many different methods into consensus results - top!
  • GSVA - Bioconductor package
  • PGSEA - Bioconductor package
  • SeqGSEA - Bioconductor, with features for RNA-seq data
  • PathwayProcessor - webinterface, determine differentially regulated pathways

Pathway knowledge databases

You can fetch gene sets from different pathway databases.