Searching NCBI's Proteins database

From BITS wiki
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

Mortasecca.png Warning: Information is liquid. Records change all the time: info is removed and added. Therefore screenshots may not be up-to-date.

The Protein database is a collection of protein sequences from several sources, including translations from annotated coding regions in GenBank and RefSeq, as well as records from curated high quality protein databases like SwissProt. Since it contains translations of Genbank sequences, this means that all sequence errors in Genbank are transported to the Protein database. Fortunately, you can always check the source of the information in Protein records.
NCBI's Proteins database also contains protein sequences from the PDB database. Although PDB is a curated, high quality database, it does contain a lot of data of synthetic proteins. It's a protein structure database, containing protein annotation, sequences and 3D structures. In many cases protein structure determination is done on crystallized forms of proteins. To improve crystallization, small mutations are introduced in the proteins. As a result the protein sequences found in PDB are slight deviations from the real protein sequences as they occur in nature.

Exercise 1: heaviest human protein

Go to the Genbank record of the heaviest isoform.

Gene is a database that contains more information on the function of the protein.

Exercise 2: downloading multiple sequences in FASTA format

Make sure that all filters were cleared before you start the exercise.

This returns a set of 10 transcription factors.