Pairwise sequence alignments

From BITS wiki
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

Pairwise sequence comparison is widely used in sequence analysis. We can discern three approaches:

  • graphical techniques (fast dotplots)
  • dynamic programming techniques (slow)
  • heuristic methods (fast)

    Dynamic programming: global and local approaches

    An optimal alignment can be calculated, if you provide a scoring scheme (for matches and mismatches) and a way to deal with gaps (gap penalty and gap extension scores). Two popular algorithms for this differ in the way they align the sequences: either they make the ends of the sequences match (global alignment) or they don't require the ends of the sequences to match (local alignment).

    Exercise 1: MYOD1

    We will compare the MyoD1 proteins of several species: human, mouse and fruitfly. We can compare them pairwise. The MyoD1 gene plays a role in the differentiation of myoblasts and its protein product contains a DNA-binding helix-loop-helix motif (see module 3).
    Go to the EBI's pairwise alignment tools website. This page gives you access to the EMBOSS programs needle and water (which implement, as you might guess, the Needleman-Wunsch and the Smith-Waterman algorithms).
    Use needle (global alignment) to align the human and mouse MyoD1.

    • Upload or paste human MYOD1 and mouse MYOD1 into the upload boxes
      For the moment, we will go with the default settings for scoring matrix, gap opening penalty and gap extension penalty.
    • Click Submit

    The results are what we expected: the two homologous sequences are nicely aligned over the complete length.


    Important parameters can be found in the header of the file:

    • The scoring scheme that was used (substitution matrix, gap opening penalty, gap extension penalty)
    • The length, number of identical and similar amino acids, number of inserted gaps, and similarity score of the obtained alignment

    The BLOSUM62 matrix is used, although EMBOSS calls it EBLOSUM62. EMBOSS considers two amino acids similar if their corresponding value in the BLOSUM62 matrix is positive (identical amino acids are also counted as similar).
    Below the header the report shows the best alignment that it could make with our parameters settings. You see that this alignment was not really difficult.

    EMBOSS uses an output style where aligned amino acids are connected by: | if they are identical, : if they are positive and . otherwise.

    Of course, it gets more challenging with more divergent sequences. Download the fruitfly MyoD1 sequence.

    Let's try to get a more compact alignment by fine-tuning the parameters.

    From the global alignments, it is clear that the central portion of the sequence aligns best.
    Let's see what water (local alignment) tells us about these sequences.

    So, local alignments can help you to align only the best matching portions of a sequence.

    The needle and water algorithms can also be used to align DNA molecules: click Nucleotide on EBI's pairwise alignment tools website. Aligning DNA causes the programs to use a +5/-4 match/mismatch scoring scheme (EDNAFULL matrix of EMBOSS).

    Perform a global alignment of the human and the fruitfly MyoD1 CDS.
    Reminiscing the results of the global protein sequence alignment, it might be a good idea to use a gap extension penalty of 5 again.

    One more important thing: local alignment will align the best aligning portion of your sequence. Sometimes, multiple similar regions exist in your sequences, but local alignment will only align the best and/or longest match, hiding the other similar regions...
    Keep this in mind! Always try multiple tools and multiple parameter settings for aligning your sequences.

    The graphical way of detecting similarity: dotplots

    Exercise 1: mouse embryonic zeta-globin

    We will use Ugene to investigate the usefulness of dotplots.

    We want to see the location of the introns so we need a dot plot with a small window. The window size is called minimum repeat length in Ugene: we will use 10 nt.

    You can minimize the windows with the sequences using the buttons indicated in the figure.


    On the dot plot you see that a region near the end of the mRNA sequence is repeated in the gene sequence (green on the figure):


    You see that you can easily identify introns and exons using dot plots of mRNA and corresponding gene sequences.

    Exercise 2: human TPA

    Human apolipoprotein A (UniProtKB/Swiss-Prot ID: P08519) contains 38 kringle domains. Kringles are protein domains (see later in this module) that have a typical structure and are involved in the interaction between blood proteins and cell membranes.

    Create a dot plot of the protein against itself in the same way as in the previous exercise (word size = 10; threshold = 100%)

    This creates a rather spectacular plot with a perfect diagonal line as we expect from aligning a sequence to itself and a very high number of repeats.
    By clicking a repeat near the upper left corner of the plot we can see the location and the sequence of the repeat. When you open the human apolipoprotein A sequence in a text editor and you look for this sequence, you see that it is indeed repeated many times just as the dot plot shows you:


    So we have visualized the kringle but the plot looks a bit messy especially at the right and bottom edges.

    When you look in the sequence you see that kringle domains are very similar but not entirely identical so it might be a good idea to choose a lower threshold.


    Apparently, the kringles at the end of the sequence are less conserved, which is indeed true when you look at the sequence:


    Exercise 3: Compare chromosomes

    One of the most useful applications of dot plots is that they give an easily interpretable representation of the alignment of very long sequences like full chromosomes.

    We will compare the complete RefSeq sequence of Chromosome I and Chromosome VIII of Saccharomyces cerevisiae S288c.

    we will create a dot plot of the two sequences using word size = 50 and threshold = 100%.

    Three of these inserted sequences lie parallel to the main diagonal. The fourth one is perpendicular to the main diagonal,