Ex 4 BLAST, fastA, siRNA design

From BITS wiki
Jump to: navigation, search
Go to parent Introduction to Bioinformatics#Exercises_during_the_training
Contributed by Guy Bottu

fastA at the EMBL-EBI

We want to search a sequence against a set of sequences in order to pick out sequences that show a significant degree of similarity with our query sequence. One tool that can do this is Pearson's fastA. fastA is older than BLAST, it is slower but sometimes finds things that BLAST does not find, especially for noncoding DNA.

The U. of Virginia, where fastA is developed does provide a fastA server, but we will rather use the fastA server of the EMBL-EBI, which has a richer collection of databanks. Go to http://www.ebi.ac.uk, then select the menu item "Tools"/"Similarity & Homology"/"FASTA".
Error creating thumbnail: Unable to save thumbnail to destination

We will search a fragment of the bacteriophage lambda genome containing the third right operator (OR3) and the right promoter (PR) against the bacteriophage section of EMBL to see if there exist homologous regulatory elements (running against the complete EMBL databank would take to much time for this exercise session).

Click "FASTA-Nucleotide" to access the program page, then perform the following operations:

  • Under "DATABASES" there is a selection window where you can select (by putting in reverse video) one or more items using the mouse, eventually in combination with the <Ctrl> or <Shift> keys. Here just select "EMBL Standard Phage".
  • Set the "RESULTS" selector to "interactive" (we are confident that the search will not take much time, but if you are afraid to bump on a time-out, it can be a good idea to choose the option of receiving by Email a hyperlink that points to the result as it is temporarily stored on the server)
  • Upload M25165.fasta
  • Finally, "Run FASTA3"

observations:

  • The "raw" output of the fasta3 program has been postprocessed by the interface. You will get a page with on top an overview of the parameters with which fasta3 has been run. Note that "Database" should be em_rel_std_phg. Below you get a list of hits. Note the buttons that allow you to obtain different representations of the result. Click on "VisualFasta", this will lead to a page with a graphical overview of the hits, followed by the "raw" fasta3 output (with just some hyperlinks added).
  • In the actual fasta3 output you find first the list of "hits" and then the alignments between the query sequence and the databank sequences. There are clickable icons that allow you to jump directly from a hit line to the alignment and there are hyperlinks pointing from the accession number of a found sequence to the databank entry in the SRS server of the EBI.
  • Note the different scores (initn, init1, opt, Z-score, bits, E()). Note that fastA searches not only the query sequence but also its complement against the databank; this makes sense because we do not know in advance the orientation of eventual homologues in the databank. You can see whether the orientation of the query sequence is forward [f] or reverse [r].
  • We find a number of perfect matches (opt=470), because the sequence is already in the databanks. We find also a number of times an alignment of the operator OR3 with its own complement on the negative strand (opt=126), which is not surprising since the operator is a palindrome. We find furthermore a few hits with the left operator OL1 of bacteriophage  (opt=113) ; these "hits" involve partial genomes, so that the query cannot be preferentially aligned with OR3, and are already in the "twilight zone" (0.1<E()<10). We also find a number of "hits" with other phages that are close or (in the "twilight zone") not so close relatives of lambda.
  • Note that when the found alignments are the same, the init1, initn and opt scores are always the same, but the Z-score, bits score and E() value) can be different. If the databank sequence is longer, the Z-score (and the E() value) is smaller, because the estimated score for an unrelated sequence of same length is subtracted.

BLAST at the NCBI

We will now use the BLAST server of the NCBI. For this exercise, we will work with simulated data : suppose a peptide of 27 amino acids long has been recovered from the supernatant of an E. coli culture suspected to be infected with an unknown lysogenic bacteriophage. We will search this peptide against the SwissProt databank (which contains only well annotated proteins) and see if the "best hits" give us a clue about what the peptide could be. Note that if SwissProt or RefSeq yields nothing we can always search "nr", a nonredundant composite databank based on the Protein table of Entrez.

Go to http://www.ncbi.nlm.nih.gov and click on "BLAST" in the "Popular Resources" box. Then click "protein blast" to access the program page and perform the following operations:

  • In the "Enter accession number, gi, or FASTA sequence" box, do type :
        >supernatant peptide
        AASVAVDIAYEGDWKTDGFMIGVGYKF
  • You can, if you like, type in a "Job Title", although it is possible that the interface automatically puts in the fastA title line.
  • Set the "Database" selector to "Swissprot protein sequences (swissprot)"
  • Finally, click on "BLAST".

observations:

  • Depending on how many searches are being performed right now the result might appear nearly immediately or you might have to wait a long time. The NCBI uses a queuing system that allows users to retrieve the results at their convenience and which avoids the "time-out" of the Web. If the final page with the BLAST output did not appear too quickly, you will notice that that you obtain a "Request ID". Should it happen that you are waiting for a too long time, just write down this RID (or do a copy-and-paste in Notepad). You can later from the BLAST home page go to the "Recent Results" leaflet.
  • Note that the NCBI server will also on-the-fly have performed a search of your sequence against the Entrez Conserved Domains databank (see later theory and exercises about "motifs").
  • The "raw" BLAST output has been postprocessed and contains a graphical overview representing the matches between your query sequence and sequences from the databank. If you mouse over a coloured line a description line + scores appears in the box above, if you click on the coloured line you jump to the corresponding alignment (provided it is available ; a BLAST output usually reports more "hits" than alignments). Note also the many hyperlinks pointing to the Entrez Server of the NCBI ; the identifier is a hyperlink to the sequence, the coloured squares are hyperlinks to various sequence-related data (in this case you will probably only have 'G' pointing to "Gene").
  • Note the different scores : the score in bits, the "raw score" between parentheses, the Expect/E value.
  • The best hit, the protein with SwissProt ID VLOM_LAMBD (and A P03701), is coded by a gene of bacteriophage lambda and turns out to be an outer membrane protein involved in virulence. The query peptide matches the COOH-terminal part. You can see this by comparing the length of the databank sequence with the numbering on the alignment. So, we may conclude with a good degree of confidence that our peptide is a fragment of a protein coded by the unknown bacteriophage.

The NCBI VecScreen tool

Databanks sometimes contain errors. In this exercise we will prove that a human sequence from EMBL still contains some piece of cloning vector. We will use the VecScreen tool of the NCBI, which performs a BLAST search against the UniVec databank, which contains vector sequences from Genbank and some never published sequences from commercial sources, selected in such a way as to provide a nonredundant set.

You can start by taking a look at the file M34511.embl. Then, go back to the NCBI BLAST home page and follow the link "vector contamination" in the "Specialized BLAST" list.

Error creating thumbnail: Unable to save thumbnail to destination

Copy-and-paste the content of the file M34511.fasta and "Run VecScreen". In the result page you will have to click "View report". The interpretation is rather obvious : the sequence is contaminated, although the reported exon falls outside the vector remnant. Note by the way that if you used a similar tool at the EBI, which runs a BLAST against EmVec, you would not have seen the contamination, because the vector sequence is not in EMBL/GenBank/DDBJ.

selecting siRNA for "silencing" a gene

The sarcolemmal membrane-associated protein of the mouse is known to be expressed in myoblasts and is believed to play a role in the fusion of myoblasts into myotubes. Now suppose you want to test experimentally the role of this gene by performing a "silencing" experiment. More particularly, you consider using a plasmid with an siRNA gene expressed from a DNA polymerase III promoter. There exists various stand-alone software tools and Web sites that help you to design siRNAs. We will use the Web site from Ambion Applied Biosystems (Austin TX, USA), which implements the Tuschl & Elbashir criteria, to select a suitable target in SLMAP mRNA, so that there is a good chance that the siRNA really "silences" the gene.

Go to http://www.ambion.com/techlib/misc/siRNA_tools.html and follow the link "siRNA Target Finder". Paste the content of NM_032008.txt (the server only accepts "raw" sequence) in the sequence window and notch "4 or more A's or T's in a row...", then "Submit". You will get a result page that allows you to resubmit a search with modified parameters, as well as to further analyze the proposed siRNAs.

It is preferable not to select an siRNA target in the 5'-UTR or in the first 50 or 100 bases of the CDS because regulatory proteins binding to the 5'-end of the mRNA might interfere with siRNA operation. In the results page from the Ambion server you can read for each proposed siRNA the "Position in gene sequence:" and you can open the file NM_032008.refseq with the WordPad to find where the CDS is.

The next thing to do is to check that the target is unique, so that we are confident that the siRNA will "silence" only the gene of interest. Since we are only interested in mRNA ("hits" against noncoding DNA are not relevant), we need to search a complete list of all mouse mRNAs. A suitable source of information is the RefSeq databank, which contains sets of (predicted) transcripts, generated from the "raw" data in EMBL/GenBank/DDBJ.

Note that if you click on a "Blast search" link in the Ambion result page you are sent to the NCBI BLAST server, but unfortunately not where you need to be. Therefore open a new Web browser window, go to the NCBI BLAST home page, follow the link to "Mouse" in the list of "BLAST Assembled Genomes" and perform the following configurations :

  • Copy-and-paste one of the "target sequences" in the box. It should be a sequence of 21 bases long. Note that the NCBI server accepts as well sequences in fastA format as just the "raw" sequence.
  • Select "Database" "RefSeq RNA".
  • Select "Program" "BLASTN" (megaBLAST is not sensitive enough).
  • Set "Expect" to 10. Short alignments might have an E() value of more than 0.01 and here we really want to see all "hits", whether they represent a biological homology/analogy or not.

You can repeat the operation with a few siRNA targets. You will of course each time find back the SLMAP mRNA itself (RefSeq entry NM_032008); a good target should make no contiguous hit longer than 15 bases with a mRNA different from the SLMAP mRNA.