Sequence similarity searches: BLAST, OrthoMCL...

From BITS wiki
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training


We will now demonstrate the use of NCBI's BLAST tool. Although many websites offer the BLAST tool, like UniProt, the version from NCBI is the most complete, offers the largest variety of databases to search in, the largest set of parameters to refine your search.

Exercise 1: peptide from poorly growing E. coli

A colleague at VIB has an E. coli culture growing very poorly. Can you find whether this novel peptide, that was found in the cell lysate of the culture, might have something to do with the poor growth?
This is the protein sequence:
>supernatant peptide

A good strategy is to first search the most curated protein database: SwissProt (which contains only well annotated proteins) and see if the best hits give us a clue about what the peptide could be.

Go to the protein Blast form.

Depending on your parameter choices and the usage of the BLAST server, results might appear quicker or slower. Blastp is in general fast.

The Alignments section provides all the details for each hit: the alignments of all the matching parts with scores...

Click the accession number to go to the Protein record of this protein. You see that this protein is coded by a gene of bacteriophage lambda. The protein is incorporated into the outer membrane of the host during infection.

The hit is not 100% identical to the query. So we know that the query is not derived from bacteriophage lambda. However, we might assume with a good degree of confidence that the query peptide is a fragment of a protein encoded by a phage and that we are confronted with a contaminated E. coli culture.

Exercise 2: Locust essential genes

Locusta (grass hopper or locust) can be really devastating to crops (See this video for an example). Every time a swarm lands, a real disaster happens to the farmers. While chatting at the bar the other evening, we suddenly have this idea. There is so much known of the model insect Drosophila : could we use this information to develop something to prevent locusts from causing such damage? For example, by silencing some essential gene in Locusta?
The hypothesis: a transgene in the crop produces double-stranded RNA against an essential gene of Locusta. By eating the crop, we hope that the double-stranded RNA enters the cells of the grass hopper, where it silences the gene. Silencing this gene should disturb the physiological processes to such an extent that vitality of the insects is reduced, leading to less damage.
First we need info on crucial genes in Drosophila: are there any crucial genes? I did some research and came up with this quote:

The epigenetic regulation of gene expression by the covalent modification of histones is a fundamental mechanism required for the proper differentiation of germ line cells during development. Trimethylation of histone 3 lysine 9 (H3K9me3) leads to chromatin silencing and the formation of heterochromatin by recruitment of heterochromatin protein 1 (HP1). SETDB1/Eggless (Egg) is the only essential H3K9 methyltransferase in Drosophila.'

Perhaps this information is worth to continue with?

The next part is for you. We need to identify this SETDB1/Eggless (Egg) gene in Locusts.

Copy the Refseq accession to use as input in BLAST.

Go to the protein Blast form.

As you can see you don't have to use a sequence or fasta file as input, accession numbers also work. That being said, it only works with Genbank/RefSeq accession numbers not with accession numbers of external databases like UniProt.

So we didn't find any homologs in grasshopper. Does it mean that there are no homologs in grasshopper?

Not necessarily, the search results do show that we might need to search for a homolog in other sequence data from Locusta, not contained in the nr database!

First of all, we need to check if a genome sequence is available for Locusta. Genome sequences are stored in NCBI's Genome database.

Our search generates a record of the locust genome sequencing project.

This means that in the future, the genome sequence will become available but in the meantime we have to use other sequence information for Locusta.

We have seen that ESTs (expressed sequence tags) represent parts of the transcriptome of an organism. Due to the relative ease of sequencing ESTs, they were often referred to as the poor man's genome, in the time when sequencing was still very expensive. We also know that EST sequences are not part of the nr database so we haven't searched them in our previous BLAST search.

So these ESTs could perhaps contain sequence information of the SETDB1/Eggless gene. To test this we have to repeat our BLAST search but now on the EST database.

You get back one hit with a very high E-value. So your academic future looks bleak.

However, you realise that NCBI has recently increased the default seed length for protein sequence searches from 3 to 6 increasing the stringency of their searches. In a final attempt to save your research topic, you decide to give it a try with the "old" seed length of 3.

Joy! One hit found with a relatively good E-value of 2E-07.


Now we can retrieve the EST sequence, develop primers and clone the complete gene. The start of a beautiful story...

Exercise 3: low complexity protein

Download the protein sequence of chicken Histone H1. The amino acid composition of this protein is biased towards A, P and K. Such proteins are called low complexity proteins. These low complexity proteins are abundant in nature because their low complexity allows them to bind to many different interaction partners.

The highest scoring hit is of course the chicken histone H1 sequence itself.

The alignment only shows the aligned regions. To look at the neighbouring amino acids you need to download the full sequence from the RefSeq record.

In the Refseq record we search the next 30 amino acids in the zebra finch sequence. These are positions 124-154:

taaakpk kpaakkpasa akkpkkaaav kks

Compare this sequence to the corresponding region in the chicken sequence:


You see that these sequences are still very similar but because of the low complexity in this region (it consists almost solely of A, P and K) alignment was not performed here.

Because of their repetitive nature low complexity proteins are very difficult to align. Highly repetitive regions can give rise to false positives in aligments. BLAST therefore has a filter that excludes low complexity regions from alignments. The filter is on by default in nucleotide BLAST and off by default in protein BLAST.


Now the Taeniopygia guttata sequence has a higher bits score than the original chicken sequence. Remarkably, the orginal chicken H1.11L sequence is only the fourth hit in the list.

Go to the actual alignments. The region between position 44 and 121 is the only part of the chicken sequence that is used for the alignment. The other parts are too repetitive and have been filtered out. Even in the region that was withheld for alignment a part was filtered out (from position 60 to 75). This part is shown in small grey letters (red box in figure).


Sequence similarity info in Ensembl

It's not always necessary to do a sequence similarity search to find homologs of your favourite gene. Many databases have already done this for you and present the results.

Go to Ensembl. Search the human F9 (Coagulation factor IX Precursor) gene (see exercises on Ensembl for details).

The Location page

Click the Location tab to go to the Location page. The most detailed view of the Location page has many tracks that represent conservation.

The Gene page

Click the Gene tab to go to the Gene page.

You want to investigate this further in mouse, so the first thing you need to check is whether a mouse ortholog for the human F9 gene exists.

You can choose orthologs from many species. All F9 orthologs as identified by Ensembl are shown in the Selected orthologues table at the bottom of the page. The table can be exported as an Excel spreadsheet by clicking on the Excel icon. You can also specify the species of interest in he Summary of orthologues of this genetable or by clicking the Configure this page button in the left menu


The Selected orthologues section at the bottom of the page now displays rodent orthologs. There is indeed a mouse orthologue with Ensembl ID ENSMUSG00000031138.


The number of species for each ortholog type is shown in the Summary table. Ortholog types are assigned by comparing two species:

  • 1-to-1 orthologs: only one copy is found in each species
  • 1-to-many orthologs: one gene in one species is orthologous to multiple genes in another species
  • Many-to-many orthologs: multiple orthologs are found in both species

The scores shown in the table:

  • Target %ID in the Selected orthologues section: percent of identical amino acids in the ortholog compared with the gene of interest (human F9)
  • Query %ID: percent of identical amino acids in the gene of interest (human F9) compared with the ortholog.

Note that Ensembl uses ClustalW to make alignments.

The light green bars between the two graphs indicate the position of conserved regions. You see that there is also a lot of conservation in the intronic regions.


Go back to the Selected orthologues table on the Gene page.

The view on the Location page showed strong conservation in the introns. You will not see this on a protein or cDNA alignment, for this you need a genomic algnment.

Via Configure this page in the left menu you can change the appearance of the alignment.

In the next exercise we will use the OrthoMCL database (a database of orthologs) to find paralogs and orthologs but many genes do not have an entry in OrthoMCL. The most complete alternative collection of precomputed orthologs is Ensembl so it's a good idea to start here.

Dedicated databases of groups of orthologues

OrthoMCL database

OrthoMCL is a database that contains groups of orthologs. You can search for orthologs by doing a text-based (red box on figure) or a sequence-based (green box on figure) search.


For multiple sequence alignment, we need a FASTA file containing sequences of orthologs. Note that you should take sequences from organisms closely related to your protein of interest. To improve the value of the alignment (to make it more generally applicable) you should incorporate one outgroup (= a sequence from a more distantly related organism).

If you have sequenced an organism, you can use the OrthoMCL website to upload a full proteome (a list of all protein sequences in an organism) and map the uploaded proteins to existing groups in OrthoMCL.