Uniprot

From BITS wiki
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

The UniProt database

UniProt stores protein sequences from primary nucleotide sequence data which are annotated as coding sequence (CDS), the so-called trEMBL database. The Swiss-Prot database is the other part of UniProt that stores curated high quality protein sequences with direct experimental evidence.

Mortasecca.png Warning: Information is liquid. Records change all the time: info is removed and added. Therefore, screenshots may not be up-to-date.


Many databases, like Genbank, RefSeq protein, Ensembl... contain links to UniProt.

Searching Uniprot directly is very similar to searching Genbank.

So on the search results page it's very easy to distinguish SwissProt and TrEMBL records.
Click the UniProt ID of human BRCA2 to go to the UniProt record. The left menu gives access to different sections of the record each containing a specific type of information.

Publications are shown on a separate page but the left menu also gives access to information that is displayed in the UniProt record. Everyone knows the link of BRCA2 with breast cancer but BRCA2 is also implied in pancreas cancer.

You can download protein sequences from UniProt. Many proteins have multiple splice variants, so these proteins are represented by multiple sequences that are identical over large regions. For most applications you simply need one sequence to represent a protein and this is why UniProt chooses one splice variant as the canonical one (in most cases the longest or the most prevalent).

When only one isoform is known, as is the case for BRCA2 then its sequence is automatically considered as the canonical.

Go back to the results page of our search for BRCA2 and go to the record of human BRCC3 (the fourth link).

Suppose you are interested in splice variants.

You see that UniProt records contain an enormous amount of annotation. For the Swiss-Prot part of UniProt (the part this protein belongs to) these annotations are all curated.

Now it's up to you. Try to answer the next question using UniProt without peeking at the solution.


ID conversion

We have seen several sequence identifiers (accession numbers, gi numbers, UniProt IDs, ...). Often, we need to convert one ID into another, e.g. accession numbers of RefSeq nucleotide records to UniProt IDs. Luckily, UniProt has an ID mapping tool.

Suppose you are studying gene ATPA2 in different species. You have a list of gene accession numbers from Genbank.

X99952
AY028628
AY421754
HE963806
AY096713
AY519360
BT008584
AY423440
AY521529
Now, you want to analyse the corresponding protein products for these genes.

Click the Retrieve/ID mapping tab on the UniProt home page.

Via CLC Bio Main Workbench (VIB only)