Gene Prediction

From BITS wiki
Jump to: navigation, search

EMBOSS: translation of nucleic acid into protein

The majority of protein sequences are derived from annotated nucleic acid sequences using a translation tool.

  • Go to the EMBOSS home page
  • Search the page for translation (using Ctrl + F)
  • In the NUCLEIC TRANSLATION section in the left menu, click transeq


Of course, this doesn't work when you start from a genomic sequence of a gene that contains introns. Translation tools do not take into account the presence of introns.

A good alternative for the EMBOSS Transeq tool is the Expasy Translate tool.

GENSCAN to predict eukaryotic genes

Translating DNA (see previous exercise) and gene prediction are two very different things.
The translation software will translate any DNA sequence that you feed it regardless of whether the DNA sequence really is a CDS or not. Gene prediction software on the other hand looks for sound evidence (base composition, similarity to known CDS, presence of motifs...) to predict the location of CDS in a DNA sequence.

The Genbank entry with accession number X02419 contains the sequence of the gene encoding the urokinase-type plasminogen activator. We will run gene prediction software on the sequence and see if the software manages to correctly find the CDS. GENSCAN is one of the most popular tools for gene searching. It is used by the genome annotation pipeline at EBI, while a slightly modified version (Gnomon) is used at NCBI.
GENSCAN models eukaryotic genes with HMMs (hidden markov models) for the:

  • coding parts of eukaryotic exons
  • TATA-box
  • CAP-site
  • polyA-site

The software offers these models for human (usable for all vertebrates), Arabidopsis thaliana and maize.

GENSCAN is available via the Mobyle portal. Go to the Mobyle home page.

  • In the left menu expand sequence
  • Expand nucleic
  • Expand gene_finding
  • Click genscan


In the Input section:


The Mobyle server will ask for your email address, provide it and click Ok.
The Mobyle server will ask you to validate your submission, do so and click Ok.
You can compare the result with the annotation in the Genbank file of the sequence: it has Genbank accession X02419.

GENSCAN also found a polyA-site in the correct location. Note that GENSCAN does not find non-coding exons like the first exon in the Genbank file.