Multiple sequence alignment
- 1 FAQ
- 2 Generating alignments
- 3 Changing and editing alignments
- 4 Constructing a phylogenetic tree
- 5 Improving the visualisation of the tree with Phy3D
- 6 Sequence logos
Exercise 1: toy example comparing several algorithms
Several multiple sequence algorithms exist, each with their own program and format. Fortunately, you can also find tools to convert MSA formats.
Let's discover how some of these programs behave, and which suit your needs best! As always, EBI provides a nice collection of web-based alignment tools but you can also make MSAs in Ugene. Some MSA tools allow manual editing of alignments, which will be discussed in the next exercise.
Open Ugene. We are first going to use different tools for the toy example below:
>Sequence1 GARFIELDTHELASTFATCAT >Sequence2 GARFIELDTHEFASTCAT >Sequence3 GARFIELDTHEVERYFASTCAT >Sequence4 THEFATCAT >Sequence5 GARFIELDTHEVASTCATYou can download the sequences in fasta format. Do not open the text file in Ugene. The tools for multiple sequence alignment can be found in the Tools menu.
|Parameters of Clustal Omega.|
|Clustal Omega has the following parameters:
Iterations come at a cost, they increase the run time. Each iteration will add 1-3 times it costs to make the alignment without iterations. That's why the default setting is to not use the iterations. The sequences in this example are too short and simple to see the impact of iterations so we are going to use the default settings.
|Align these sequences with ClustalOmega using default parameters.|
This opens the Clustal Omega parameters window:
This results in the following alignment:
Ugene colours the amino acids according to percentage identity. Residues that are conserved in many sequences are colored darker blue than residues that only occur in a few sequences.
You can change the coloring scheme by clicking the Hightlighting button in the right menu.
|Color the alignment according to the Zappo scheme.|
|Change the coloring scheme by clicking the Hightlighting button (red) in the right menu.
The Zappo scheme colors the sequences according to their biophysical properties:
- negatively charged amino acids (D and E) are coloured red
- positively charged amino acids (R, H and K) are coloured blue
- amino acids with polar uncharged side chains (S, T, N and Q) are coloured green
- amino acids with aromatic side chains (F, Y and W) are coloured orange
At the top of the alignment you see the consensus sequence giving a one line representation of the alignment:
- if each sequence contains the same amino acid on a position: the amino acid is printed in capitals
- if the majority of the sequences contain the same amino acid on a position: the amino acid is printed in small letters
- if all amino acids on a position are similar: a + is printed
- if the amino acids on a position are different: a - is printed
Now we will make the alignment using ClustalW2.
|Parameters of ClustalW.|
In contrast to Clustal Omega, which replaces amino acids by numbers to speed up pairwise similarity score calculations, ClustalW builds the guide tree based on real pairwise alignments.
|Make the alignment using default parameters.|
|Using the default parameters, ClustalW2 returns the following alignment:
Although EBI advises to use ClustalOmega, see the Please Note message on the ClustalW2 pageyou can see that for this example the ClustalW2 alignment is better.
|Parameters of MUSCLE.|
Another increasingly popular alignment algorithm is Muscle. It has the following parameters:
|Align the Garfield sequences with MUSCLE.|
Using the default settings, the following alignment is generated:
The MUSCLE alignment looks even better than the ClustalW2 alignment.
Another alignment algorithm is MAFFT, which states that it is one of the most accurate multiple sequence alignment methods currently available.
|Align the Garfield sequences with MAFFT using the default parameters.|
|MAFFT parameters refer to calculating similarity scores from pairwise alignments and the number of iterations you want to perform. It gives exactly the same alignment as Muscle.|
From the examples above, we see that these four algorithms give three different alignments. We can see this easily with this small toy example, but when you want to align large sets of sequences it's not so easy to see which algorithm performs best.
If you have hundreds of sequences to align, you have to take processing speed into account. In this case, your best options would be MAFFT (best quality, bit slower), and Clustal Omega (very fast, good quality).
The most popular algorithm is ClustalW, which makes use of the progressive alignment algorithm that was described in the slides but can add iterations for refinement. However, this alignment algorithm is slow, compared to the other algorithms.
A lot of multiple sequence alignment programs exist. Make your selection of MSA programs based on:
1. what you have access to
2. the number of sequences
3. the type of sequence (DNA/protein)
Changing and editing alignments
Most of the time, you are not perfectly happy with a MSA that is generated by an MSA tool and you want to change the alignment yourself. You can use these free alignment editors:
At the bottom of an alignment you see the overview, it shows the coverage of the complete alignment. Using the overview, you can see the regions of the alignment with many gaps and those without gaps. You can navigate to these parts of the alignment by clicking a region in the overview.
You can change the information shown by the overview. By right clicking the overview you can choose to show "Simple" overview which is a bird-eye view your alignment with the selected color-scheme.
Alternatively, you can navigate the alignment by dragging the sliding window to move across the alignment in the main window.
Changing the colors
|How to colour the alignment according to conservation using percentage identity as a measure ?|
In the Right menu:
Now each position in the alignment is coloured according to this colour scheme: the darker the blue, the more conserved the position is.
Selecting a part of the alignment
|How to select a part of a MSA ?|
Editing the alignment
Editing is done in two directions:
- You delete divergent sequences from the alignment.
- You also remove uninformative positions: these are positions that do not contain information on the evolutionary relation between the sequences. These positions do not contain phylogenetic information since you don't have a sequence for the other organisms there. The only thing it tells us is that these residues exist in one organism but not in the others. So you have to remove positions where you only have a sequence for one organism and not for the others.
|How to remove divergent sequences from the alignment ?|
To remove the divergent sequences from the alignment:
Smaller regions with differences can be tolerated.
|How to remove uninformative positions from the alignment ?|
You can do this by selecting a subalignment (the part in the alignment you want to remove):
The position will disappear. You see that editing the alignment can be a lot of work, especially for proteins that are not very conserved. Fortunately, more and more tools for constructing phylogenetic trees will remove these positions automatically.</p>
In the same way you could remove positions where all sequences agree since they are also not informative for constructing a phylogenetic tree. However, in practice this is never done because an alignment without fully conserved positions looks strange.
After you have removed all divergent sequences and uniformative positions we can use the alignment for phylogenetic tree construction. Save the project by clicking the Save All button in the top toolbar.
Constructing a phylogenetic tree
Ugene has three methods to calculate a tree:
- one based on maximum likelihood
- one based on neighbour joining
- one based on Bayesian statistics
|How to create a tree in Ugene ?|
|In the top menu:
This opens the Build Phylogenetic Tree window
Set the Tree building method to the algorithm you want to use.
The NJ method (Phylip) does not use a real model of evolution (only a score matrix). It simply calculates the distance between the sequences and assumes that these sequence distances reflect the genetic distances between the species.
|Parameters of Phylip.|
To enable bootstrapping go to the Bootstrapping and Consensus Tree tab. The following bootstrapping parameters are available:
The Display options tab specifies how to display the tree.
PhyML and MrBayes use a model of evolution in their calculations.
|Parameters of PhyML.|
On the Branch support tab you select the method that is used to measure branch support: a fast likelihood method or bootstrapping.
The Display options tab specifies how to display the tree.
|Parameters of mrBayes.|
|How to change the width of the branches to get a nicer looking tree.|
Red box on the figure:
The numbers on the tree reflect evolutionary distances, expressed in expected number of substitutions per position in the alignment. The number on a branch is based on the combined scores of all possible trees that contain that specific branch.
Improving the visualisation of the tree with Phy3D
Ugene automatically saves trees in Newick (.nwk) format. This format is accepted by PhyD3, so you can play with the tree in PhyD3. PhyD3 is a cool interactive tree viewer developed at VIB.
|How to load the tree in PhyD3 ?|
The tree that is displayed is a phylogram: the length of the branches represent evolutionary distances. If you want to display a cladogram deselect Show phylogram.
Newick is a pretty basic format but if you use PhyloXML files you can include sequences, taxonomy and domain annotations. PhyD3 can incorporate this info in the phylogenetic tree image:
A good way to visualise alignments with lesser sequence similarity (like the one of the histone proteins) is by sequence logos. Good tools for generating sequence logos are
- iceLogo: a VIB tool that generates a logo by calculating frequencies of amino acids in each position of the MSA and comparing them to a reference set (typically the full proteome of an organism). In this way you can account for the fact that amino acids that occur frequently in the proteome will by nature also occur frequently in your MSA and are 'less relevant' than amino acids that occur rarely in the proteome.
- WebLogo: does not use a reference set, so assumes all amino acids occur equally often in the genome
- enoLOGOS (Energy NOrmalized logos): corrects for biases in amino acid distribution.
On the Y-axis scores are shown in bits, the X-axis shows the position in the alignment. Bits are calculated in log2-scale. Since there are 20 amino acids the maximum bits-value is log2(20) = 4,32. In a sequence logo the height of an amino acids reflects its frequency in the alignment and is presented in bits.
|How to create a logo with WebLogo|
|Paste the selection in the Multiple sequence alignment box in WebLogo, increase the height of the logo to 8 cm and click Create Logo: