Multiple sequence alignment

From BITS wiki
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

Generating alignments

Exercise 1: toy example comparing several algorithms

Several multiple sequence algorithms exist, each with their own program and format. Fortunately, you can also find tools to convert MSA formats.

Let's discover how some of these programs behave, and which suit your needs best! As always, EBI provides a nice collection of web-based alignment tools but you can also make MSAs in Ugene. Some MSA tools allow manual editing of alignments, which will be discussed in the next exercise.

Open Ugene. We are first going to use different tools for the toy example below:

>Sequence1
GARFIELDTHELASTFATCAT
>Sequence2
GARFIELDTHEFASTCAT
>Sequence3
GARFIELDTHEVERYFASTCAT
>Sequence4
THEFATCAT
>Sequence5
GARFIELDTHEVASTCAT
You can download the sequences in fasta format. Do not open the text file in Ugene. The tools for multiple sequence alignment can be found in the Tools menu.

This results in the following alignment:


UgeneMSA3b.png

Ugene colours the amino acids according to percentage identity. Residues that are conserved in many sequences are colored darker blue than residues that only occur in a few sequences.

You can change the coloring scheme by clicking the Hightlighting button in the right menu.

The Zappo scheme colors the sequences according to their biophysical properties:

  • negatively charged amino acids (D and E) are coloured red
  • positively charged amino acids (R, H and K) are coloured blue
  • amino acids with polar uncharged side chains (S, T, N and Q) are coloured green
  • amino acids with aromatic side chains (F, Y and W) are coloured orange
  • ...

At the top of the alignment you see the consensus sequence giving a one line representation of the alignment:

  • if each sequence contains the same amino acid on a position: the amino acid is printed in capitals
  • if the majority of the sequences contain the same amino acid on a position: the amino acid is printed in small letters
  • if all amino acids on a position are similar: a + is printed
  • if the amino acids on a position are different: a - is printed
Above the consensus sequence you see the conservation scores that represent the similarity on each position: the higher the score the more similar the amino acids on that position are.

Now we will make the alignment using ClustalW2.

Although EBI advises to use ClustalOmega, see the Please Note message on the ClustalW2 page

you can see that for this example the ClustalW2 alignment is better.

The MUSCLE alignment looks even better than the ClustalW2 alignment.

Another alignment algorithm is MAFFT, which states that it is one of the most accurate multiple sequence alignment methods currently available.

From the examples above, we see that these four algorithms give three different alignments. We can see this easily with this small toy example, but when you want to align large sets of sequences it's not so easy to see which algorithm performs best.

If you have hundreds of sequences to align, you have to take processing speed into account. In this case, your best options would be MAFFT (best quality, bit slower), and Clustal Omega (very fast, good quality).

The most popular algorithm is ClustalW, which makes use of the progressive alignment algorithm that was described in the slides but can add iterations for refinement. However, this alignment algorithm is slow, compared to the other algorithms.

A lot of multiple sequence alignment programs exist. Make your selection of MSA programs based on:
1. what you have access to
2. the number of sequences
3. the type of sequence (DNA/protein)


Changing and editing alignments

Most of the time, you are not perfectly happy with a MSA that is generated by an MSA tool and you want to change the alignment yourself. You can use these free alignment editors:

  • MEGA --- (very powerful, for generating, visualizing and editing alignments)
  • SeaView
  • BioEdit
Or you can edit the alignment in Ugene.

Navigating the alignment

At the bottom of an alignment you see the overview, it shows the coverage of the complete alignment. Using the overview, you can see the regions of the alignment with many gaps and those without gaps. You can navigate to these parts of the alignment by clicking a region in the overview.


UgeneMSA20.png

You can change the information shown by the overview. By right clicking the overview you can choose to show "Simple" overview which is a bird-eye view your alignment with the selected color-scheme.


UgeneMSA21.png

Alternatively, you can navigate the alignment by dragging the sliding window to move across the alignment in the main window.

Changing the colors

Selecting a part of the alignment

Editing the alignment

Editing is done in two directions:

  • You delete divergent sequences from the alignment.
  • You also remove uninformative positions: these are positions that do not contain information on the evolutionary relation between the sequences. These positions do not contain phylogenetic information since you don't have a sequence for the other organisms there. The only thing it tells us is that these residues exist in one organism but not in the others. So you have to remove positions where you only have a sequence for one organism and not for the others.
<p>If you have sequences that clearly diverge a lot from the rest of the sequences, containing large regions that do not match the others, you should remove them from the alignment before you make a tree. When you are doing this for real (because you want to include a phylogenetic tree in your publication for instance), you should first try to find the reason why the sequences are so different from the rest. In many cases, it will be because of errors in the annotation e.g. an intron that was not correctly annotated. A wrongly annotated intron can have a major impact on the resulting protein sequence. So first check the genomic sequence before you actually remove sequences from the alignment. For time's sake, we skip this step and simply remove divergent sequences from the MSA.

Smaller regions with differences can be tolerated.

The position will disappear. You see that editing the alignment can be a lot of work, especially for proteins that are not very conserved. Fortunately, more and more tools for constructing phylogenetic trees will remove these positions automatically.</p>

In the same way you could remove positions where all sequences agree since they are also not informative for constructing a phylogenetic tree. However, in practice this is never done because an alignment without fully conserved positions looks strange.

Handicon.png
It is always better to use multiple tools for constructing the MSA and to compare their results (we are not going to do this for the sake of time but you should when you want to make a phylogenetic tree for your research)


After you have removed all divergent sequences and uniformative positions we can use the alignment for phylogenetic tree construction. Save the project by clicking the Save All button in the top toolbar.


Constructing a phylogenetic tree

Ugene has three methods to calculate a tree:

  • one based on maximum likelihood
  • one based on neighbour joining
  • one based on Bayesian statistics

The NJ method (Phylip) does not use a real model of evolution (only a score matrix). It simply calculates the distance between the sequences and assumes that these sequence distances reflect the genetic distances between the species.

PhyML and MrBayes use a model of evolution in their calculations.

The numbers on the tree reflect evolutionary distances, expressed in expected number of substitutions per position in the alignment. The number on a branch is based on the combined scores of all possible trees that contain that specific branch.

Improving the visualisation of the tree with Phy3D

The tree is automatically saved in Newick (.nwk) format. This format is accepted by PhyD3, so you can play with the tree in PhyD3. PhyD3 is a cool interactive tree viewer developed at VIB.

The tree that is displayed is a phylogram: the length of the branches represent evolutionary distances. If you want to display a cladogram deselect Show phylogram.

Newick is a pretty basic format but if you use PhyloXML files you can include sequences, taxonomy and domain annotations. PhyD3 can incorporate this info in the phylogenetic tree image:

indy10.png


Sequence logos

A good way to visualise alignments with lesser sequence similarity (like the one of the histone proteins) is by sequence logos. Good tools for generating sequence logos are

  • iceLogo: a VIB tool that generates a logo by calculating frequencies of amino acids in each position of the MSA and comparing them to a reference set (typically the full proteome of an organism). In this way you can account for the fact that amino acids that occur frequently in the proteome will by nature also occur frequently in your MSA and are 'less relevant' than amino acids that occur rarely in the proteome.
  • WebLogo: does not use a reference set, so assumes all amino acids occur equally often in the genome
  • enoLOGOS (Energy NOrmalized logos): corrects for biases in amino acid distribution.


Weblogo2b.png

On the Y-axis scores are shown in bits, the X-axis shows the position in the alignment. Bits are calculated in log2-scale. Since there are 20 amino acids the maximum bits-value is log2(20) = 4,32. In a sequence logo the height of an amino acids reflects its frequency in the alignment and is presented in bits.