Ensembl
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training
Contents
The Ensembl Genome browser
Ensembl is a joint project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute that annotates chordate genomes. Gene sets from model organisms such as yeast and fly are also imported for comparative analysis by the Ensembl ‘compara’ team. Most annotation is updated every two months, leading to increasing Ensembl versions.
Go to Ensembl.
Exercise 1: human F9
Search the human F9 (Coagulation factor IX Precursor) gene.
- Select the Human genome to search in
- Search for F9
- Click Go
Click the F9 (Human gene) link to go to the gene page of F9.
On which chromosome and which strand of the genome is this gene located ? |
---|
In the Location section you can see that the gene is located on the forward strand of chromosome X.
|
How many transcripts (splice variants) does this gene have ? |
---|
In the About this gene section above the transcript table you can see that there are four transcripts for this gene. Only two of them are protein coding. |
Show the transcript table if this is not yet the case.
How many CCDS are annotated for this gene ? |
---|
CCDS is a consensus coding sequence set. These coding sequences have been agreed upon by Ensembl, UCSC and NCBI. In the Transcript table you see that the first two transcripts, F9-201 and F9-202, contains a CCDS. |
What is the name of the longest transcript ? |
---|
The first transcript, F9-201, is the longest. |
How long is the protein it encodes ? |
---|
Transcript F9-201 encodes a protein of 461 amino acids. |
Compare the sequence of the longest and the smallest protein-coding transcript. |
---|
In the left menu select Transcript comparison in the Summary section.
Scroll down and click the Select transcripts button in the left menu.
Select the transcripts you want to view e.g. the protein-coding ones. Close the window (it will automatically save the selections you made)
Then you see the alignment of the two splice variants, they have the same ATG but transcript 201 has a 5'UTR while the other transcript has not.
You see Ensembl shows the entire sequence, introns included. If you want to hide them click the Configure this page button in the left menu. Select Show exons only and close the window.
Now it's more clear to view the differences between the sequences e.g. you see that around position 10000 F9-201 has an exon that is absent in F9-202. |
To download sequences choose Export data via the menu on the left.
Download the genomic, all transcript, all CDS and all protein sequences of F9 in fasta format |
---|
Go to the left menu and click Export data (see slides).
Click Next and Click HTML to view the sequences in your browser (click Text if you want to store them in a text file on your computer).
|
Click on the Transcript ID of transcript F9-201 to go to the Transcript page.
How many exons does this transcript have ? |
---|
In the Transcript summary section you can see that the transcript has 8 exons, al of them coding.
|
The transcript page also has a menu on the left, with similar looks bit different content.
Display the sequences of the exons |
---|
In the left menu under Sequence click Exons.
This opens the color-coded sequence of all the components of the F9 gene:
|
You can use Configure this page to change the graphical properties of the information that is shown.
Display the full sequences of the introns if this is not yet the case. |
---|
Click Configure this page and check Show full intronic sequence. |
The Ensembl genome browser strives to display many layers of genome annotation into a simplified view. Click the Location tab to go to the genome browser page.
Which type of gene is located directly upstream F9 ? |
---|
Go to the Location tab. On the Location page you can see various views of the gene. In the middle view you can see the gene in its genomic neighbourhood. The gene directly upstream of F9 (located at its left) is called SRD5A1P1 and its grey colour reveals that it is a pseudogene (gene that has lost its coding function during evolution). It looks like a gene but it is no longer expressed because mutations have introduced premature stop codons. |
The location page has three views of the gene and its genomic surroundings, the one at the bottom being the most detailed view.
On which cytogenetic band and on which contig is the F9 gene located ? |
---|
In the most detailed view you can see that the F9 gene is located on cytogenetic band q27.1 in the Chromosome bands track and on contig AL033403.1 in the Contigs track. |
Each type of information on the figures is displayed in a track. In the Genes track in the bottom view you can see the structure of the three F9 transcripts. The one coloured in gold is a golden transcript, it means that both Ensembl and HAVANA have predicted this transcript. In the transcripts you can see the location of:
- codings exons: filled boxes
- introns: lines
- UTRs: empty boxes
In the most detailed view you can click all components to show extra info and links.
Is the blue transcript predicted by Ensembl or by HAVANA ? |
---|
Clicking the blue transcript opens a box where you can see (in the Analysis section that this is a HAVANA transcript. |
The CCDS set track shows the CCDS for this gene. Zoom in on the sixth exon:
- draw a red box around the exon with your mouse
- select Jump to region
You can add tracks to the figure by clicking Configure this page in the left menu.
Activate the Marker track to find primers to uniquely amplify the region containing the sixth exon |
---|
STS markers can be used to uniquely amplify and represent regions in a genome. To activate the Marker track:
Now the Marker track is added to the figure at the bottom. You can see that the exon is located at the site of a marker. Click the ID of the marker and click Marker info.
You are redirected to the Ensembl marker page where you see the primer info of this marker. |
At the bottom of the left panel of the Configure this page window you can save and load track configurations. If you want to go back to the default track settings you can click Reset configuration. You can delete tracks by clicking in the figure on the name of the track and then clicking the X.
Go back to the Gene page. By default, a gene summary is shown on the Gene page but there is a lot more info available via the left menu.
Find a link to WikiGenes, an external database, to see if F9 can be used for prenatal diagnosis of hemophilia |
---|
Click External references in the left menu. |
You also see links to UniProt, NCBI's Gene (EntrezGene) and UniGene database...
Ensembl Genomes
The Ensembl Genomes database contains the sequences and annotation of organisms that are not covered by Ensembl, such as bacteria, plants, fungi and more. When you go to the Ensembl genomes website you can select the taxon that you're intersted in.
Exercise 2: Ensembl Plants
Click the Plants taxon link to the EnsemblPlants database. The EnsemblPlants home page contains an overview of all supported plants with links to their corresponding genome pages. The genome pages show general info on the genome, e.g. ploidy, length of the sequence, assembly details, you can download sequence and annotation...
Who provided the Rice annotation? |
---|
On the EnsemblPlants home page you can find rice (red) in the category popular genomes. If you cannot find the genome you need here, you can click View full list of all Ensembl Plants species (green). you will get a complete overview of all represented plant species.
Click the rice link (red) to the home page of the rice genome. Here you can see that the Provider of the annotation is RAP-DB (red).
Note that you can download the genome in FASTA format (blue). This can be helpful when you're analyzing NGS data and you need to map your reads ! |
We now go the Arabidopsis homepage.
Recently, our colleagues from PSB unraveled a molecular switch that controls stress response in Arabidopsis, hence opening the door for breeding plants with larger tolerance for stress. They found that transcription factor MYB29 is a negative regulator of mitochondrial stress response. Deletion of the gene leads to increased sensitivity to light and drought stress.
The user interface of EnsemblPlants is identical to the user interface in Ensembl so it should be easy to search the gene in EnsemblPlants.
What is the size of the MYB29 transcript ? |
---|
Type the name of the gene in the search box and click Go. You are redirected to the results summary page. Click on the TAIR Gene ID (the ID that starts with AT) to open the Gene page of MYB29 (red). You are redirected to the gene page of MYB29, although the location page and the transcript page are also opened. In the transcript table on the gene page you can see that the size of the transcript is 1731 bp. |
Ensembl Plants Gene pages have a left menu that is identical to these in Ensembl.
What is the molecular function of MYB29 ? |
---|
In the Ontologies section of the left menu of the gene page you can click GO: Molecular function.
When you click this link the Gene Ontology annotation is loaded in the main window. You can see that the MYB29 protein is a transcription factor that binds the DNA. |
Of course, this knowledge is only useful when we can extrapolate it to crops. Arabidopsis is just a weed, we don't eat it so improving stress tolerance in Arabidopsis is not what we want in the long run.
Use the links in the left menu to search for crops that contain a homolog of MYB29 ? |
---|
In the left menu of the gene page you can select Gene tree in the category Plant compara.
You can see that close homologs are found in Brassica species (oilseed rape, cabbage, broccoli...) but there are also homologs in bean (Glycine max) and grape (Vitis vinifera)... |
If you want to visualize the actual alignments you can click Genomic alignments in the section Plant compara.
Open the genomic alignment of Arabidopsis - Brassica rapa ? |
---|
The genomic alignments consist of a set of pairwise alignments. Select the alignment with Brassica rapa and click Go to view the alignment. The MYB29 sequence aligns to two different sections in the Brassica genome probably due to a duplication in the Brassica genome. There's a part of 800 bp of MYB29 that maps to chromosome A03 in Brassica and an overlapping part of 2000 bp that maps to chromosome A10. Click Block 1 to visualize the alignment. The red nucleotides represent exons, the black ones introns or intergenic.
|
Export is done in exactly the same way as in Ensembl.
Download the CDS of MYB29 in FASTA format. |
---|
On the gene page you can select Export data under the Gene-based displays box.
This opens the export window. Keep all default settings except those in the Options for Fasta sequence category:
In the next Export data window click Text to open the sequence in FASTA format in your browser. |
And visualizations on the location page are identical to these in Ensembl.
Visualize the sequence variation in this gene. |
---|
For this you need to go to the location tab. Click Configure this page under the Location-based displays box.
You have to wait a few moments so that the image can be adjusted.
When you zoom in a lot you can see the individual nucleotides. You can get more information on a variant by simply clicking it (red):
|
If the plant that you work on is not represented in Ensembl Genomes, you might consider taking a look in these alternative plant sequence databases: