From BITS wiki
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

The purpose of BioMart is to provide uniform access to a set of different biological databases.

You can use the web portal, called Bio Portal, to do your searches or you can download and install the software on your computer. We will use the web portal in these exercises, so go to the BioMart home page.
A simple BioMart query involves

  • choosing a dataset to search in
  • setting filters to restrict the search space
  • specifying the type of data you want to retrieve

*Exercise 1: human proteins with a retinol binding domain

Suppose we want to retrieve all human coding sequences of proteins with a retinol binding domain (IPR002449).
To start the search choose a database. For these exercises we will use ENSEMBL GENES 91.
Once you have selected a database, you can select a dataset from this database.

Next click Filters in the left menu to set filters on the search space. You can select filters by choosing or entering a value/option or by clicking a checkbox.

Next click Attributes in the left menu to choose the information that you want to retrieve. The Attributes (output types) are arranged into multiple sections which can be expanded. To choose an attribute simply click the checkbox next to its description.

When you are happy with the query you can preview the results by clicking the Results button in the top panel.

Exercise 2: proteins from human chromosome Y

Retrieve the HGCN Gene symbols of the proteins from chromosome Y for further functional annotation.

We want to use the results for an enrichment analysis so we need HGCN symbols only (the enrichment tools like DAVID only accepts one column of IDs as input).

Copy the results and go to DAVID to do an enrichment analysis.

DAVID will now add annotations to these genes. It counts the number of times each annotation occurs in your list and compares this number to the average frequency over the complete genome. In this way it identifies genes that occur more frequently in your list than on average in the genome (=enriched).

It fetches annotations from different sources and you can look at the enriched annotations from each source separately. However, you can also look at the results of all sources combined, which is more informative in my opinion.

Relevant enriched annotations include:

  • spermatogenesis and related annotations
  • sexual differentiation and related annotations
These results were more or less expected for sex chromosome encoded proteins.

However, more striking is the enrichment of proteins involved in regulation of transcription linked to the following enriched annotations:

  • RNA binding proteins and related annotations
  • chromatin organisation and related annotations
The remaining 2 clusters have very high adjusted p-values so I would not consider these.

*Exercise 3: ID conversion in BioMart

Remember the potential human TP53 targets from the exercises on finding TF binding motifs in DNA sequences. Originally, the ChIP Seq experiment generated a list of gene names of potential TP53 targets. To use them in one of the RSAT tools I had to convert the gene names to Ensembl Gene IDs.