Uniprot
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training
The UniProt database
UniProt stores protein sequences from primary nucleotide sequence data which are annotated as coding sequence (CDS), the so-called trEMBL database. The Swiss-Prot database is the other part of UniProt that stores curated high quality protein sequences with direct experimental evidence.
Warning: Information is liquid. Records change all the time: info is removed and added. Therefore, screenshots may not be up-to-date.
Many databases, like Genbank, RefSeq protein, Ensembl... contain links to UniProt.
Searching Uniprot directly is very similar to searching Genbank.
Search the SwissProt record of human BRCA2 |
---|
Type BRCA2 in the search box and click Search
The first link on the Results page is the Reviewed (aka SwissProt) record of human BRCA2. |
So on the search results page it's very easy to distinguish SwissProt and TrEMBL records.
Click the UniProt ID of human BRCA2 to go to the UniProt record. The left menu gives access to different sections of the record each containing a specific type of information.
How many scientific papers are linked to BRCA2 in UniProt ? |
---|
Scroll the left menu and click to view the Publications field.
There are over 1500 publications on this protein |
Publications are shown on a separate page but the left menu also gives access to information that is displayed in the UniProt record. Everyone knows the link of BRCA2 with breast cancer but BRCA2 is also implied in pancreas cancer.
What is the mutation that is linked to pancreas cancer ? |
---|
The information that we are looking for is located in the Pathology and Biotech section.
When you scroll down the table of variants you see more info on the link BRCA2 - pancreatic cancer. |
You can download protein sequences from UniProt. Many proteins have multiple splice variants, so these proteins are represented by multiple sequences that are identical over large regions. For most applications you simply need one sequence to represent a protein and this is why UniProt chooses one splice variant as the canonical one (in most cases the longest or the most prevalent).
Download the canonical sequence in FASTA format |
---|
You can download the sequence via your Basket but you can also go to the Sequences section, where you can use the FASTA button:
|
When only one isoform is known, as is the case for BRCA2 then its sequence is automatically considered as the canonical.
Go back to the results page of our search for BRCA2 and go to the record of human BRCC3 (the fourth link).
Why do we get this record when we search for BRCA2 ? |
---|
When you search for BRCA2 in the record (with Find) you see that BRCC3 interacts woth BRCA2 and BRCA1. The record thus contains the word BRCA2 and this is what you searched for. |
Download the canonical sequence in FASTA format ? |
---|
Go to the Sequences section. You see that BRCC3 has 5 isoforms thus 5 sequences. Isoform2 was chosen as the canonical one. |
Suppose you are interested in splice variants.
Would it be possible to download the sequences of all isoforms of a protein from SwissProt ? |
---|
Yes, you save the protein in your basket and select "FASTA (canonical & isoforms)" as download format. |
Would it be possible to download the corresponding mRNA isoform sequences from SwissProt ? |
---|
No, SwissProt only contains protein sequences. For this you need to go to RefSeq. RefSeq contains a single record for each gene sequence, links this gene record to the records of all transcript sequences (splice variants) the gene gives rise to and links each transcript record to the corresponding protein record. |
You see that UniProt records contain an enormous amount of annotation. For the Swiss-Prot part of UniProt (the part this protein belongs to) these annotations are all curated.
Now it's up to you. Try to answer the next question using UniProt without peeking at the solution.
What is according to UniProt the link between human F9 and hemophilia ? |
---|
Go to the the UniProt/SwissProt record. In the Pathology & biotech field you can find the info. |
ID conversion
We have seen several sequence identifiers (accession numbers, gi numbers, UniProt IDs, ...). Often, we need to convert one ID into another, e.g. accession numbers of RefSeq nucleotide records to UniProt IDs. Luckily, UniProt has an ID mapping tool.
Suppose you are studying gene ATPA2 in different species. You have a list of gene accession numbers from Genbank.
X99952 AY028628 AY421754 HE963806 AY096713 AY519360 BT008584 AY423440 AY521529
Click the Retrieve/ID mapping tab on the UniProt home page.
Download the corresponding UniProt protein sequences for these gene accessions ? |
---|
On top of the results table click the Download link. Select FASTA (canonical) format and there you go... |
Via CLC Bio Main Workbench (VIB only)