Sequence similarity searches: BLAST, OrthoMCL...

Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

BLAST

We will now demonstrate the use of NCBI's BLAST tool. Although many websites offer the BLAST tool, like UniProt, the version from NCBI is the most complete, offers the largest variety of databases to search in, the largest set of parameters to refine your search.

Exercise 1: peptide from poorly growing E. coli

A colleague at VIB has an E. coli culture growing very poorly. Can you find whether this novel peptide, that was found in the cell lysate of the culture, might have something to do with the poor growth?
This is the protein sequence:
>supernatant peptide
AASVAVDIAYEGDWKTDGFMIGVGYKF

A good strategy is to first search the most curated protein database: SwissProt (which contains only well annotated proteins) and see if the best hits give us a clue about what the peptide could be.

Which BLAST version are you going to use ?
Since we want to compare a protein sequence against a protein database: we have to use the Protein Blast.

Go to the protein Blast form.

Where are you going to paste the sequence of the cell lysate peptide ?
Provide the sequence in the Enter Query Sequence section.

Define the database you want to search in.
In the Choose Search Set section, several parameters are available to specify the database you want to search in: Database- you can choose a database from the drop down list: for our search we select UniProtKB/Swiss-Prot (swissprot) Organism - you can provide an organism name to restrict the database that you selected above to sequences from this organism only. We are not going to use this for our search. Entrez (GQuery) query - you can enter an Entrez (GQuery) query here to restrict the search only to the sequences that are returned by this query. We will not use this for our search

Which Blast algorithm are you going to use?
In the Program Selection section we want to use the normal blastp (protein-protein BLAST) algorithm. This is the default. Remember you can click the ? for help.

Run Blast to check where the peptide comes from
Run BLAST by clicking .

Depending on your parameter choices and the usage of the BLAST server, results might appear quicker or slower. Blastp is in general fast.

Look at the Graphic summary on the Blast results page, how many good hits do you have ?
The BLAST result page is a single page with all information. In the Graphics Summary section a graphical overview of the BLAST results is displayed, with each hit from the database represented by a colored line. The color represents the bit score (Color Key above the table). We have a good hit, represented by a green bar. The query sequence is schematically represented by the blue bar at the top, to show where the hits match the query sequence. If you click a colored line, a description line and scores appear.

Look at the Desciptions section, does the best hit match the entire sequence of the query?
The Descriptions section contains a table summary of the hits. The description of the hit is linked to the actual alignment of hit and query sequence. The last column contains the accession number and links to the record of the hit in the Protein database. In the center columns, you find: maximum score: each hit is displayed only once but a hit can have multiple parts that match the query. The maximum score displays the score of the highest scoring part. total score: the scores of all parts of a hit that match the query are summed and reported query coverage: the percentage of the query sequence that is covered by the alignment E-value: a statistical score to assess whether the hit is significant ident: percentage of identical residues in the alignment So the alignment is made over the complete sequence of the query (coverage = 100%), but that doesn't mean that the hit and the query are completely identical (ident = 74%).

The Alignments section provides all the details for each hit: the alignments of all the matching parts with scores...

Go to the Alignment of the seventh hit, how many regions match between the hit and the query?
The query sequence is indicated by Query, the hit as Sbjct. The seventh hit has four regions that are similar to the query sequence. That's why the total bit score is much higher than the max bit score.

Go to the Description of the seventh hit, how many matches do you see here between the hit and the query?
In the Descriptions section, only the highest score is reported under Max score. You see that there have to be additional matches because the total score is higher than the max score.

Go to the Description of the best hit, what is the protein accession of this hit?
The best hit has protein accession P03701.

Click the accession number to go to the Protein record of this protein. You see that this protein is coded by a gene of bacteriophage lambda. The protein is incorporated into the outer membrane of the host during infection.

Go to the Alignment of the best hit, which part of the hit matches the query?
The query matches the COOH-terminal part of the hit. You can see this by comparing the length of the hit with the numbering on the alignment.

The hit is not 100% identical to the query. So we know that the query is not derived from bacteriophage lambda. However, we might assume with a good degree of confidence that the query peptide is a fragment of a protein encoded by a phage and that we are confronted with a contaminated E. coli culture.

Exercise 2: Locust essential genes

Locusta (grass hopper or locust) can be really devastating to crops (See this video for an example). Every time a swarm lands, a real disaster happens to the farmers. While chatting at the bar the other evening, we suddenly have this idea. There is so much known of the model insect Drosophila : could we use this information to develop something to prevent locusts from causing such damage? For example, by silencing some essential gene in Locusta?
The hypothesis: a transgene in the crop produces double-stranded RNA against an essential gene of Locusta. By eating the crop, we hope that the double-stranded RNA enters the cells of the grass hopper, where it silences the gene. Silencing this gene should disturb the physiological processes to such an extent that vitality of the insects is reduced, leading to less damage.
First we need info on crucial genes in Drosophila: are there any crucial genes? I did some research and came up with this quote:

The epigenetic regulation of gene expression by the covalent modification of histones is a fundamental mechanism required for the proper differentiation of germ line cells during development. Trimethylation of histone 3 lysine 9 (H3K9me3) leads to chromatin silencing and the formation of heterochromatin by recruitment of heterochromatin protein 1 (HP1). SETDB1/Eggless (Egg) is the only essential H3K9 methyltransferase in Drosophila. http://www.ncbi.nlm.nih.gov/pubmed/19750210.'

Perhaps this information is worth to continue with?

The next part is for you. We need to identify this SETDB1/Eggless (Egg) gene in Locusts.

Retrieve the RefSeq accession number of the SETDB1/Eggless protein from Drosophila.
We will query the Protein database: Go to the NCBI home page. Set database to Protein Paste SETDB1/Eggless in the search term field The first result is the Refseq Drosophila sequence (recognizable by the underscore in the accession number).

Copy the Refseq accession to use as input in BLAST.

Which BLAST version are you going to use ?
Since we want to compare a protein sequence against a protein database: we will use the Protein Blast.

Go to the protein Blast form.

Where are you going to paste the accession ?
Provide the accession in the Enter Query Sequence section.

As you can see you don't have to use a sequence or fasta file as input, accession numbers also work. That being said, it only works with Genbank/RefSeq accession numbers not with accession numbers of external databases like UniProt.

Define the database you want to search in.
We will use the nr database (the default). nr is a database of all non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from Whole Genome Sequencing projects. In other words this is the most complete protein sequence database we can search in. However, since we are only interested in grasshopper sequences we are going to limit the search to grasshopper sequences by typing grasshoppers and locusts in the Organism box (if you start typing grasshoppers the form should automatically give suggestions).

Which BLAST algorithm are you going to use?
In the Program Selection section we want to use the normal blastp (protein-protein BLAST) algorithm. This is the default.

What is the result of the BLAST search?
Run BLAST by clicking . The Blast results page notifies: No significant similarity found.

So we didn't find any homologs in grasshopper. Does it mean that there are no homologs in grasshopper?

How many protein sequences are available for grasshopper(Locusta)?
Go to the home page of the Protein database: Go to the NCBI home page. Set database to Protein Leave the search term field empty Click Search Do an Advanced search for locusta in the Organism field. This search returns more than 3000 results, which is a pretty limited set of proteins (repeating the search with drosophila gives almost a million records).

Not necessarily, the search results do show that we might need to search for a homolog in other sequence data from Locusta, not contained in the nr database!

First of all, we need to check if a genome sequence is available for Locusta. Genome sequences are stored in NCBI's Genome database.

Find the Locusta record in the Genome database
Go to the home page of the Genome database: Go to the NCBI home page. Set database to Genome Leave the search term field empty Click Search To answer this question, do an Advanced search for Locusta[Organism].

Our search generates a record of the locust genome sequencing project.

Check in the Genome record if the sequence of the genome is available.
Clicking the link to the corresponding Genbank record we find that although the sequencing project has been initiated, the sequences are not yet in Genbank: The sequence is stored in the Trace archive, the NCBI database that holds incomplete genomes that are still in the process of being sequenced/assembled.

This means that in the future, the genome sequence will become available but in the meantime we have to use other sequence information for Locusta.

Do a GQuery search to check if other sources of sequence information are available for Locusta
To solve this, you could perform a GQuery search for Locusta[Organism] by changing the Genome database into All databases and clicking Search. In the Genes section on the results page you do see that there are more than 45000 EST sequences available so maybe we can use these.

We have seen that ESTs (expressed sequence tags) represent parts of the transcriptome of an organism. Due to the relative ease of sequencing ESTs, they were often referred to as the poor man's genome, in the time when sequencing was still very expensive. We also know that EST sequences are not part of the nr database so we haven't searched them in our previous BLAST search.

So these ESTs could perhaps contain sequence information of the SETDB1/Eggless gene. To test this we have to repeat our BLAST search but now on the EST database.

Which BLAST algorithm are you going to use ?
We want to compare a protein sequence, eggless of Drosophila, against the EST sequences of Locusta. So we have to use tblastn to search a translated nucleotide database with a protein sequence.

Can you find an EST sequence in Locusta that is similar to the eggless protein ?
Paste the eggless accession number (NP_611966.3) in the Enter Query Sequence field Select as database Expressed sequence tags (est) Add as organism Locusta Click BLAST

You get back one hit with a very high E-value. So your academic future looks bleak.

However, you realise that NCBI has recently increased the default seed length for protein sequence searches from 3 to 6 increasing the stringency of their searches. In a final attempt to save your research topic, you decide to give it a try with the "old" seed length of 3.

Repeat the search setting the seed length to 3 ?
So you go back to the BLAST form. Enter the accession number NP_611966.3 as the query sequence Select as database Expressed sequence tags (est) Add as organism Locusta Click Algorithm parameters to expand the parameters of the BLAST Go to word size and set it to 3 Click BLAST

Joy! One hit found with a relatively good E-value of 2E-07.

Now we can retrieve the EST sequence, develop primers and clone the complete gene. The start of a beautiful story...

Exercise 3: low complexity protein

Download the protein sequence of chicken Histone H1. The amino acid composition of this protein is biased towards A, P and K. Such proteins are called low complexity proteins. These low complexity proteins are abundant in nature because their low complexity allows them to bind to many different interaction partners.

What is the highest scoring BLAST hit in the RefSeq database ?
We are going to blast a protein sequence to a protein database so we are going to use the protein blast blastp upload the chicken histone sequence select the refseq_protein database to search in select to exclude model sequences

The highest scoring hit is of course the chicken histone H1 sequence itself.

What is the coverage of the second best hit ?
Alignment to the uncharacterized protein of zebra finch is done over 41% (red box on figure) of the chicken histone sequence.

Go to the actual alignment with the second best hit.
Go to the alignment by clicking on the uncharacterized protein LOC100220586 [Taeniopygia guttata] link (green box on figure) in the BLAST Descriptions table:

The alignment only shows the aligned regions. To look at the neighbouring amino acids you need to download the full sequence from the RefSeq record.

Download the full sequence of the second best hit.
Click the RefSeq link: ref\|NP_001232807.1

In the Refseq record we search the next 30 amino acids in the zebra finch sequence. These are positions 124-154:

taaakpk kpaakkpasa akkpkkaaav kks

Compare this sequence to the corresponding region in the chicken sequence:

You see that these sequences are still very similar but because of the low complexity in this region (it consists almost solely of A, P and K) alignment was not performed here.

Because of their repetitive nature low complexity proteins are very difficult to align. Highly repetitive regions can give rise to false positives in aligments. BLAST therefore has a filter that excludes low complexity regions from alignments. The filter is on by default in nucleotide BLAST and off by default in protein BLAST.

Turn the low complexity filter turned on and redo the blast
To turn on the filter: Expand the Algorithm parameters section Select Low complexity regions in the Filters and Masking section

Now the Taeniopygia guttata sequence has a higher bits score than the original chicken sequence. Remarkably, the orginal chicken H1.11L sequence is only the fourth hit in the list.

Go to the actual alignments. The region between position 44 and 121 is the only part of the chicken sequence that is used for the alignment. The other parts are too repetitive and have been filtered out. Even in the region that was withheld for alignment a part was filtered out (from position 60 to 75). This part is shown in small grey letters (red box in figure).

Why is the bit score of the Taeniopygia sequence higher than that of the highest scoring chicken sequence ?
At first sight this seems very strange. Bit scores are calculated based on: the alignment the scoring system that was used We can clearly see that both alignments are exactly the same: they have the same identities (78/78) positives (78/78) gaps (0/78) But they do result in different bit scores. So the difference must lie in the scoring system. Both alignments were scored by the BLOSUM62 matrix, but compositional matrix adjustment was performed (green box in figure). To compare proteins with biased compositions, the scores in the BLOSUM matrices are not optimal. The scores must be adjusted taking into account the amino acid composition of the two sequences. To compare proteins with markedly biased compositions, the adjusted BLOSUM matrices yield alignments that are in better agreement with structural evidence. So differences in compsition between the chicken and the Taeniopygia sequence lead to different adjusted scores in the BLOSUM62 matrix which in its turn leads to different bit scores.

Sequence similarity info in Ensembl

It's not always necessary to do a sequence similarity search to find homologs of your favourite gene. Many databases have already done this for you and present the results.

Go to Ensembl. Search the human F9 (Coagulation factor IX Precursor) gene (see exercises on Ensembl for details).

The Location page

Click the Location tab to go to the Location page. The most detailed view of the Location page has many tracks that represent conservation.

How to display the conservation tracks?
Click Configure this page In the Comparative genomics section click Conservation regions These tracks will show regions that are conserved across 31 vertebrate species or 67 mammal species. Conserved regions are displayed as dark bands. You can click these bands to see the conservation score and the exact location of the region. Of course, the exons are the most conserved parts of the gene but there are also conserved regions upstream and in the introns reflecting a possible regulatory function.

The Gene page

Click the Gene tab to go to the Gene page.

Find paralogs of F9
In the left menu go to the Comparative Genomics section and click Paralogues The human genome contains an elaborate set of paralogs of F9. This makes sense: click the Phenotypes link in the left menu, mutations in F9 are linked to hemophilia B, a monogenic disease. According to Chen et al., 2013 human monogenic disease genes are known to have many functionally redundant paralogs.

How to download the protein sequences of the paralogs ?
Click the Download paralogues button. This opens a window where you can specify what you want to download and what kind of output you want. Choose FASTA format (a standard format for sequences) Choose to download unaligned protein sequences Click Download

You want to investigate this further in mouse, so the first thing you need to check is whether a mouse ortholog for the human F9 gene exists.

Find the link in the left menu that will show you the info you need.
Go to the left menu and click Orthologues in the Comparative Genomics section.

You can choose orthologs from many species. All F9 orthologs as identified by Ensembl are shown in the Selected orthologues table at the bottom of the page. The table can be exported as an Excel spreadsheet by clicking on the Excel icon. You can also specify the species of interest in he Summary of orthologues of this genetable or by clicking the Configure this page button in the left menu

.

Show the orthologs in rodents
In the Orthologues section go to the Summary of orthologues of this gene. Check the Show details box for Rodents to display the orthologs for rodents in the Selected orthologues section.

The Selected orthologues section at the bottom of the page now displays rodent orthologs. There is indeed a mouse orthologue with Ensembl ID ENSMUSG00000031138.

The number of species for each ortholog type is shown in the Summary table. Ortholog types are assigned by comparing two species:

1-to-1 orthologs: only one copy is found in each species
1-to-many orthologs: one gene in one species is orthologous to multiple genes in another species
Many-to-many orthologs: multiple orthologs are found in both species

The scores shown in the table:

Target %ID in the Selected orthologues section: percent of identical amino acids in the ortholog compared with the gene of interest (human F9)
Query %ID: percent of identical amino acids in the gene of interest (human F9) compared with the ortholog.

Note that Ensembl uses ClustalW to make alignments.

How to display the orthologous mouse and human region underneath each other?
In the Selected orthologues section click Compare regions in the row of the mouse ortholog to put the graphical displays for mouse and human underneath each other on the Locations page.

The light green bars between the two graphs indicate the position of conserved regions. You see that there is also a lot of conservation in the intronic regions.

Go back to the Selected orthologues table on the Gene page.

Find the link in the table to display the cDNA or protein alignment of human F9 and its mouse ortholog.
You can ask for an alignment of the cDNAs or the proteins via the View Sequence Alignment link.

The view on the Location page showed strong conservation in the introns. You will not see this on a protein or cDNA alignment, for this you need a genomic algnment.

Find the link in the left menu of the Gene page to display genomic alignments
To view alignments of genomic sequences: Go to the Gene page of F9 Click Genomic alignments in the Comparative Genomics section of the left menu to load the alignments info in the main window

View the pairwise alignment of human F9 and the mouse ortholog
Click the Select an alignment button in the bottom section of the page. This opens a window where you can choose between multiple and pairwise alignments. Click the PARWISE button. This opens a menu where you can make your choice Select RODENTS & LAGOMORPHS. Select MICE. Select Mouse (reference). Click Apply to display the alignment of the human F9 genomic sequence and that of the mouse in the main window.

Via Configure this page in the left menu you can change the appearance of the alignment.

Color start and stop codon, color the conservation, display 60bp per row and genomic coordinates.
In the Codons section select START/STOP codons. Select Show conservation regions. In the Line numbering section select Relative to coordinate systems. In the Number of base pairs per row section select 60bps.

In the next exercise we will use the OrthoMCL database (a database of orthologs) to find paralogs and orthologs but many genes do not have an entry in OrthoMCL. The most complete alternative collection of precomputed orthologs is Ensembl so it's a good idea to start here.

Dedicated databases of groups of orthologues

OrthoMCL database

OrthoMCL is a database that contains groups of orthologs. You can search for orthologs by doing a text-based (red box on figure) or a sequence-based (green box on figure) search.

For multiple sequence alignment, we need a FASTA file containing sequences of orthologs. Note that you should take sequences from organisms closely related to your protein of interest. To improve the value of the alignment (to make it more generally applicable) you should incorporate one outgroup (= a sequence from a more distantly related organism).

How to download sequences as FASTA file ?
Register to be able to easily download the sequences that you need from this website. In the List of sequences, click the basket icon of the sequences you want to download. When a sequence is added to your basket, the basket icon is colored green. Click the My Basket link at the top of the page (red box on figure): To download the sequences, click the Download link (green box on figure). This opens the download page. Save the sequences in FASTA protein sequence format.

If you have sequenced an organism, you can use the OrthoMCL website to upload a full proteome (a list of all protein sequences in an organism) and map the uploaded proteins to existing groups in OrthoMCL.

How to upload a full proteome for mapping to OrthoMCL groups ?
In the top menu expand Tools Select Assign your proteins to groups Upload a FASTA file containing your protein sequences. More information can be found in Fisher et al., 2012.

Sequence similarity searches: BLAST, OrthoMCL...

Contents

BLAST

Exercise 1: peptide from poorly growing E. coli

Exercise 2: Locust essential genes

Exercise 3: low complexity protein

Sequence similarity info in Ensembl

The Location page

The Gene page

Dedicated databases of groups of orthologues

OrthoMCL database

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox