Archive for Module 2

From BITS wiki
Jump to: navigation, search

FastA at the EMBL-EBI

Introduction of FastA

Pearson's FastA finds similar sequences in a database given a query sequence. FastA is older than the popular BLAST, it is slower, but sometimes finds things that BLAST does not find, especially for non-coding DNA (e.g. promotor sequences). We will use the FastA server of the EMBL-EBI, which has a richer collection of databanks. Go to http://www.ebi.ac.uk, then select the menu item "Tools"/"Similarity & Homology"/"FASTA".
Error creating thumbnail: Unable to save thumbnail to destination

We will search with a fragment of the bacteriophage lambda genome containing the third right operator (OR3) and the right promoter (PR) to find similar sequences in the bacteriophage section of EMBL databank. (Note:running against the complete EMBL databank as compared to the bacteriophage section would take to much time for this exercise session - remember that fasta is rather slow).


By default the FASTA search page is set up for protein similarity search. Since we want to search nucleotide, we have to switch to the Nucleotide Databases' under 'Other types' in a box at the right side of the screen.

Error creating thumbnail: Unable to save thumbnail to destination

After that, we can select the database and our query sequence.

Error creating thumbnail: Unable to save thumbnail to destination


We will choose the default FASTA program, and click
Error creating thumbnail: Unable to save thumbnail to destination



Exploration of the output:

Error creating thumbnail: Unable to save thumbnail to destination


  • fasta program is run on the server of EBI with the parameters you provided. The "raw" output of the fasta program (text) has been postprocessed by the website into a nice layout. So now you are presented a list of sequences similar to your query sequence.
  • First you might check the parameters by going to Submission details. Check if the database should be 'em_rel_std_phg'. Then go back to the summary page.
  • In the actual fasta output you find the list of "hits", ordered by E-value. You can group them on the different columns by clicking on the header.

Error creating thumbnail: Unable to save thumbnail to destination


The first value you should check, is the E(xpect)-value, which is number of similar hits happen purely by chance (similar = similar length in a similar database). So 5.5E-13 means that once in 5.5*10^13 searches with a similar but random query and a similar but random database, such a hit will be found (by chance). So 5.5E-13 is a good E-value. If the databank sequence is longer, the E() values are better (lower) and the scores will be higher. This is because the chance to obtain a similar score for an unrelated sequence of same length is reduced.

The column called 'Length', 'identities' and 'positives' should explain themselves: the length of the hit found, the % of identical residues, and similar residues. In case we are dealing with nucleotides (as now), similar has been put equal to identical.


  • FASTA searches not only the query sequence but also its complement against the databank; this makes sense because we do not know in advance the orientation of eventual homologues in the databank. You can see this from the coordinates in the alignment for the first two shortest hits.


  • To have nice graphical view of the results, click on Visual Output
Error creating thumbnail: Unable to save thumbnail to destination


Summary

  • You might use Fasta when searching non-coding nucleotide sequences (UTR, promotor regions,...)
  • Unfortunately, Fasta webinterface at EBI accepts only one sequence per submission


The NCBI VecScreen tool

Databanks sometimes contain errors. In this exercise we will prove that a human sequence from EMBL still contains some piece of cloning vector. We will use the VecScreen tool of the NCBI, which performs a BLAST search against the UniVec databank, which contains vector sequences from Genbank and some never published sequences from commercial sources, selected in such a way as to provide a nonredundant set.

You can start by taking a look at the file M34511 in the Nucleotide database of NCBI. Then, go back to the NCBI BLAST home page and follow the link "vector contamination" in the "Specialized BLAST" list.

Error creating thumbnail: Unable to save thumbnail to destination

Copy-and-paste the accession number and "Run VecScreen". In the result page you will have to click "View report". The interpretation is rather obvious : the sequence is contaminated, although the reported exon falls outside the vector remnant. Note by the way that if you used a similar tool at the EBI, which runs a BLAST against EmVec, you would not have seen the contamination, because the vector sequence is not in EMBL.

BLAST: Mass spec peptide


Pairwise comparisons

Nice graphical comparison of two sequences, using SIM6 and LALNVIEW

The SIM6 program of Huang and Miller calculates the best local alignment, and also a series of non-overlapping suboptimal alignments. Laurent Duret made a modified version of SIM that writes output in .lav format, suitable to be visualized with the graphical viewer LALNVIEW.
Go to the web site http://www.expasy.ch/tools/sim-prot.html. Although SIM can handle nucleic acids as well as proteins, the Web page at the SIB is configured only for proteins. Note: it is possible to download and compile the source code of the modified SIM and use it locally, also for nucleic acids.

We have already mentioned the "kringle" domains. While the tissue plasminogen activator contains two "kringles", plasminogen contains five and the urokinase-type plasminogen activator contains only 1. Type plmn_human in the "AC or ID" box for "SEQUENCE 1" and urok_human in the "AC or ID" box for "SEQUENCE 2", then click "Submit". You will get a page with the classic SIM output.


Error creating thumbnail: Unable to save thumbnail to destination


Click on the "here" from "Click here to view these alignments..." and save the file 'view-laln-file.pl' (or whatever you wish to rename it). Start LALNVIEW on your PC: you can find it under "All programs". Select "File"-->"Open" the file.

Lalnview.png

You see 2 alignments with a score of more than 120 (BLOSUM62) represented by coloured bars. You can click on any part of the coloured bars to adjust the alignment of the bars. Only similar portions with a score higher than 120 are displayed. You can lower this score by replacing 120.0 in the "Similarity Score Threshold" box by 50.0 and press <enter>.

Error creating thumbnail: Unable to save thumbnail to destination


You will then see 5 alignments. To get more information on a portion of the sequence, note the additional bars above and below the upper and down sequence bar. Clicking on those will reveal the name of the domain and motifs. Note however that this information has simply been extracted from the annotation that accompanies the sequences in the SwissProt databank, it has not been computed.

Error creating thumbnail: Unable to save thumbnail to destination


We can see that the first 4 correspond to an alignment of a "kringle" of plasminogen with the unique "kringle" of urokinase, the 5th alignment encompasses the last "kringle" of plasminogen as well as all the motifs that follow. If you click on a coloured bar, the alignment will appear in a second window. If you set the threshold to 0 you will see even more alignments, up to 20 ; the reason is that SIM cannot evaluate significance, it continues to compute non-overlapping alignments with ever decreasing scores until no alignment with score higher than 0 can be found or until a limit (here set to 20) is reached.

There exists also a SIM2 program that computes local and suboptimal alignments using the fastA algorithm (and is in fact identical to the lfasta program from the fastA suite); SIM3 and the improved version SIM4 [1] are designed to align a gene with a mRNA, with a provision for finding splice sites.


The graphical way of detecting similarity: dotplots

We will use Dotlet to investigate the usefulness of dotplots. This is a JAVA program, accessible on http://myhits.isb-sib.ch/cgi-bin/dotlet. A dotplot program such as this creates a visual representation of the sequence similarity of two sequences. In the folder C:\data-bioinfo, you will found the sequences X62302 and the M24173. One is a gene sequence, the other cDNA. You can search more details from these sequences by using the accession number.

Let's put those two sequences into DotLet.

Dotlet1.png

Using the input button, you can import two sequences: this must be done by copy-paste. For every sequence, you have to click this button.Select from the second and third dropdown box respectively the two sequences. Click compute with the default settings.

Error creating thumbnail: Unable to save thumbnail to destination



The left graph shows you the dotplot graph, the right graph shows you the score distribution. Above the right graph, the parameters are summarized.

On the dotplot graph, one sequence is placed on the vertical axis, another on the horizontal axis. If the residue on the Y-axis is identical to the X-axis, a white dot is drawn. If the residue are different, a black dot is drawn. The complete graph looks as a random collection of black and white dots. But where the sequences are similar, white diagonal lines are visible. To zoom out, change the last button at the top, from 1:1 to 1:7. The button before that button, which is set by default on 15, you can set this to 43. This is the window size: to draw a white dot, if set to 15, the 7 residues before and the 7 residues after that position needs to be identical.

Click compute with the new settings, and you will see a more meaningful plot.
The longer sequence (gene) is on the X-axis, the shorter (cDNA) on the Y-axis. In the dotplot window there are crossed blue lines, you can drag the crosspoint with the mouse, causing the text in the sequence window at the bottom to follow, according to the position of the cross. You can move one base at the time by using the arrow keys. You can estimate the positions of the three exons (and by consequence of the two introns) by projecting the diagonals of similarity on the horizontal axis. The blue lines are a great help for doing this with precision.

Error creating thumbnail: Unable to save thumbnail to destination




Dotplot of protein sequences

The human tissue plasminogen activator precursor (in UniProt/SwissProt with ID TPA_HUMAN and AC P00750) contains two "kringle" domains. "kringles" are protein domains (protein sequence parts that are shared between different proteins - see next module) that have a typical structure and are involved in the interaction of certain blood proteins with each other and with cell membranes. We will see protein domains in the next part: sequence analysis. But for now: to know more about this protein domain, you may want to check this site.

You can find the file TPA_HUMAN.fasta in the C:\data-bioinfo folder, to solve the next exercise.


The sequences that will save your scientific life

An experiment of yours concludes that two proteins are similarly regulated. So you get the sequences of those proteins. Funny though, at first sight, you cannot find a similarity between the proteins. Is this a mistake?

>Similarly_regulated_protein1
KGMYYKNEHFTDKCSGNPFHGQSCAFHNIWIPRLFGPQEKVEPASCTHCGDGLYSAPQAL
DLEEQTAWTSGPWTCRKYVPVCNKEWLNFKDIEQCGCGGHKLPYYVMSSMMWTPGTTHVK


>Similarly_regulated_protein2
GEHFHPTNDDMCGCGGHKLWFQEQHGIFALKSPEDYMDFGQMQSTNTIDWIALNVCWEYI
ECFTLPTCTAWWKKGSLCLFDHFCIKCWMDHIWLNHIMMDNCRNNYRSHSMGINLMVFLW


Alignment editors: GeneDoc

Start GeneDoc and open the file ADHs.msf using the "File"/"Open" menu. You will see that GeneDoc has displayed the alignment using different shades of grey to indicate the degree of conservation. GeneDoc can present the alignment in a variety of ways and is very suited to prepare alignments for publication. To edit your alignment, put Genedoc in "Grab and Drag" mode: do this by selecting with the mouse the "Arrange"/"Grab and Drag Sequences" menu option, but also by clicking on the most left icon
Error creating thumbnail: Unable to save thumbnail to destination
in the toolbar or by typing <Ctrl-A>. Now you can use the mouse to slide amino acids as beads on a wire. Do some editing of the alignment. Hover with your mouse over the buttons on the toolbar to discover their function. Print your results out as pdf.



Alignment editors: BioEdit

Another good free editor (but not well maintained) I want you to know is BioEdit. It has alignment algorithms build in. It can read a variety of sequence and alignment formats, can color the sequence on different criteria and is very convenient for editing. In addition, you can ask statistics of the alignment, such as consensus sequence, similarity and identity matrices etc.


Use BioEdit to align the 8 plant Histone H1 sequences in fasta format you have retrieved during previous exercises. First open the fasta file, and next, choose 'accessory application' -> 'ClustalW multiple alignment'

Error creating thumbnail: Unable to save thumbnail to destination


Once the alignment is completed, you get the colourful results which can edited it further.

With
Error creating thumbnail: Unable to save thumbnail to destination

the above selector you can switch between 'viewing' and 'editing' mode: in the Select/slide mode, the mouse can be used to select sequence parts. Once selected, sequences can be copy-pasted. In the edit mode, the mouse acts as a cursor and sequences can edited. In the Grab/drag mode, the mouse can be used to drag a sequence forward in the alignment. However, BioEdit is not being properly maintained anymore, so be careful for crashes and save your work regularly.
The alignment can be coloured according to the different similarity matrices. Go to 'Alignment' -> 'Similarity Matrix (for aligment and shading)'. Default, the Identity matrix is selected. You can select other, for example, PAM40.

Note that at the C-terminal end a lot of prolines appear (embedded in basic residues as R and K). We will demonstrate the following principle in Bioedit.


Go to 'File' -> 'Graphic View' for a graphical view of the alignment. You can set numerous shading options. To export the file, export it as 'rich text format' (rtf). This can be opened in Word and be included in an article.

Summary

Pairwise similarity detection can be done with graphical methods (dotplots), optimal methods (water and needle), which take a long time to compute, and heuristic methods (BLAST), which are fast, but do not guarantee you the optimal result.


Your gene of interest

Search the position in the genome using BLAT and extract the sequence

For your gene of interest, use the genomic nucleotide sequence from RefSeq to do a BLAT search to the genome, using the BLAT search portal. BLAT can align a nucleotide sequence to genome if the sequence similarity is high enough (>80%).

Error creating thumbnail: Unable to save thumbnail to destination



The results-page lists portions of the genome with high similarity. Don't pick the 100% similarity, but look for a hit with lower percentage similarity. How similar is this portion of the genome? How long is the hit? Can you extract the nucleotide sequence of the genome of that hit in fasta format?

Hint: click on the second entry in the left panel

Error creating thumbnail: Unable to save thumbnail to destination



Align the genomic sequence

Align the genomic sequence with all other sequences that you have retrieved in Module 1 exercises. Use any clustal omega (e.g. from ebi http://www.ebi.ac.uk/).

  • How divergent are the sequences from that genomic sequence part?
  • Can you generate a logo from the most similar part?
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training



Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training