DNA sequence analysis

From BITS wiki
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

Gene regulation via Ensembl

At lot of sequence analysis has already been done for you in specialized sequence databases such as Ensembl. Go to the region from bp 52,000,000 to 52,200,000 on human chromosome 4.


Ensembl14b.png

  • Zoom in on the SGCB transcript, including a bit of flanking sequence on both sides.
  • In the middle graphic select the Drag/Select Ensembl31.png button and draw with your mouse a box around the SGCB transcript. If you don't see the button, it means that Drag/Select has already been activated.
  • Click Jump to region in the pop-up menu


Ensembl14c.png

CpG islands are genomic regions that contain a high frequency of CG dinucleotides and are often located near the promoters of mammalian genes. There are tools that predict potential CpG islands based on nucleotide composition. These predictions can be visualized in Ensembl.

Possible transcription start sites can be predicted using the Eponine tool. Eponine predictions are one of the tracks available in Ensembl.

An important mechanism of gene regulation is the accessibility of the DNA. For human and mouse Ensembl provides several tracks for visualizing the status of the chromatin, based on data from the ENCODE project.

This project aims to experimentally characterize all functional elements in the human and mouse genome: TFBS, TSS, open chromatin regions, non-protein-coding transcripts... The project obtains this information among others from large scale ChIP-Seq or DNA-Seq experiments like the sequencing of DNA regions that were cut by DNase to identify closed and open chromatin regions in the DNA.

Similarly you can also define TSS as the regions that bind to the RNA polymerase. Encode performed ChIP-Seq experiments on various cell type samples to identify DNA regions that bind to RNA polymerase.

Finally, DNA accessibility can also be regulated by modification of histone proteins. DNA winds itself around these proteins and as such they regulate the accessibility. If the DNA is tightly wound aroound them, it is not accessible. If the DNA loosens from the histones it becomes accessible. how well the DNA can bind to the histones is determined by the structure of the histones: i.e. which modifications are on the surface of the histone proteins ? Methyl groups, acyl groups or other modifications all influence the binding of the DNA .

ENCODE performed ChIP-Seq experiments with antibodies that target modified histone proteins to identify the DNA regions that are bound to them.

Finally, you can also search for binding sites of specific TFs in Ensembl, again based on ENCODE data.


RNA processing via miRWalk

MiRWalk is a database that contains information on miRNAs from Human, Mouse and Rat. For each miRNA, it contains predicted as well as experimentally validated binding sites in their target genes.

In a mouse study, you find that the expression of TIMP3 influences the development of atherosclerosis. A decreased expression of TIMP3 stimulates inflammation of the arterial endothelium and triggers atherosclerosis.

You have searched for TFBS for TFs that might be the cause of the TIMP3 repression but to no avail. Maybe the expression of TIMP3 is regulated via other mechanisms, like miRNA induced degradation.

For decades, scientists have assumed that miRNAs bind to sequences in the 3'UTR of target genes.

You spend a lot of time checking if these miRNA are the cause of the decreased TIMP3 expression in atherosclerosis but you find no link at all. Quite frustrated you decide to perform a final analysis with an open mind. Although everybody states that miRNAs target 3'UTRs, you decide to scan the CDS for potential targets.

You go back to the lab and find that atherosclerosis activates mir-172, which in its turn decreases the expression of TIMP3, thereby worsening the atherosclerosis.

Handicon.png It is important to search miRNA binding sites within the complete sequence (promoter, 5'-UTR, CDS and 3'-UTR) of a gene !


Find regulatory motifs using MotifLab

We have identified a number of genes that respond to DNA damage and stress, initiating cell-cycle arrest or triggering apoptosis. Download a list with the Ensembl IDs of the potential targets. The first thing you need to do is creating a new protocol (=analysis). I want to know which TFs might be responsible for the observed upregulation of these genes in response to stress.

If you want to repeat your analysis later, you can document everything you do now via the record function.

The first thing you have to do is to import the sequences of the promoters you want to analyze.

Next you need one or multiple motifs. Jaspar and the public part of Transfac are by default in MotifLab but you can add your own or other public motif collections.

DNA sequences contain repeat regions that might interfere with the motif searches. It is recommended to remove repeat regions from your sequences. MotifLab allows you to do so.

Motif searches always generate an enormous amount of false positives (regions in the DNA sequence that are similar to the motif but do not actually bind to the TF). You can reduce the number of false positives by focussing on regions in the promoter that are conserved between species. This is called phylogenetic footprinting. It is assumed that regions that are functionally important (like TFBS) do not change during evolution because if they change, their function would be lost.

MotifLab allows you to add feature annotation tracks to display and use annotation of the sequences, like the presence of conserved regions, repeats, open chromatin regions...

We will use the following tracks: Conservation and RepeatMasker

Now you can use the annotation to remove motifs that are located in repeat regions or in regions with low conservation.

You can do the same with other annotations e.g. remove motifs in closed chromatin regions...

You should, however, when you do this for real, use a better algorithm than the Simple Scanner. You can for instance download and add Clover.

Finding enriched TF binding motifs in a list of genes via Pscan

We performed additional experiments to find additional genes that show the same expression pattern as the ones in the previous exercise, so we now have an extended list of coexpressed genes.

The p-value should be read as: "If we take as many random genes from the same organism as in the input set, and we select their corresponding promoter regions, what is the probability of having the same score as obtained in the input set?"

However, Pscan checks 130 matrices (or 280 if you select Transfac) so 130 independent statistical tests are performed. This means you have to correct for this if you do not want to end up with a long list of false positives (the p-value implies that the TFBS is enriched while in fact it is not).

Pscan uses a Bonferroni correction, meaning that the typical significance threshold (0.05 or 0.01) is divided by the number of tests (in our case: 130). This means that a significance threshold of 0.05 becomes 0.0003 after Bonferroni correction. When you use this value as significance threshold, the score for HFN1a is not significant. However, it should be noted that if you do so many t-tests the Bonferroni correction is not the ideal choice: it's too stringent in these cases.


In the exercises above we searched for known TF binding motifs but often you do not know the motif that regulates the expression of your genes of interest and you have to search for unknown motifs.
There are several tools to solve the complex problem of de novo identification of motifs that are enriched in a list of sequences. The wikipedia page has a very nice overview on these tools (see [1]).

What you need as an input is a list of promoters of which you are more or less certain that they all bind to the same TF, you just don't know which TF. If you have such promoter sequences you can search for common motifs that occur in all promoters.

Besides the very popular MEME suite, you can also use one of the RSAT tools for this.

You have to do the following steps to do the analysis:

  • Get promoter sequences using the RSAT tool retrieve Ensembl seq or retrieve sequence
  • Analyse them for common motifs using one of the RSAT tools under Motif discovery


EMBOSS: translation of nucleic acid into protein

The majority of protein sequences are derived from annotated nucleic acid sequences using a translation tool.

  • Go to the EMBOSS home page
  • Search the page for translation (using Ctrl + F)
  • In the NUCLEIC TRANSLATION section in the left menu, click transeq


EMBOSS1.png

Of course, this doesn't work when you start from a genomic sequence of a gene that contains introns. Translation tools do not take into account the presence of introns.

A good alternative for the EMBOSS Transeq tool is the Expasy Translate tool.

GENSCAN to predict eukaryotic genes

Translating DNA (see previous exercise) and gene prediction are two very different things.
The translation software will translate any DNA sequence that you feed it regardless of whether the DNA sequence really is a CDS or not. Gene prediction software on the other hand looks for sound evidence (base composition, similarity to known CDS, presence of motifs...) to predict the location of CDS in a DNA sequence.

The Genbank entry with accession number X02419 contains the sequence of the gene encoding the urokinase-type plasminogen activator. We will run gene prediction software on the sequence and see if the software manages to correctly find the CDS. GENSCAN is one of the most popular tools for gene searching. It is used by the genome annotation pipeline at EBI, while a slightly modified version (Gnomon) is used at NCBI.
GENSCAN models eukaryotic genes with HMMs (hidden markov models) for the:

  • coding parts of eukaryotic exons
  • TATA-box
  • CAP-site
  • polyA-site

The software offers these models for human (usable for all vertebrates), Arabidopsis thaliana and maize.

GENSCAN is available via the Mobyle portal. Go to the Mobyle home page.

  • In the left menu expand sequence
  • Expand nucleic
  • Expand gene_finding
  • Click genscan


Mobyle1.png

In the Input section:


Mobyle2a.png

The Mobyle server will ask for your email address, provide it and click Ok.
The Mobyle server will ask you to validate your submission, do so and click Ok.
You can compare the result with the annotation in the Genbank file of the sequence: it has Genbank accession X02419.

GENSCAN also found a polyA-site in the correct location. Note that GENSCAN does not find non-coding exons like the first exon in the Genbank file.

Selecting primers for PCR

Designing regular PCR primers using Primer3

The RefSeq entry NM_079400 contains the sequence of the D. melanogaster mRNA coding for tap, the target of Poxn. Tap encodes a bHLH protein expressed in larval chemosensory organs and involved in the response to sugar and salt. We wish to amplify by PCR the region coding for the HLH motif.

The Primer3 program can help to select suitable primers.

Go to Primer3 [1] or use the nicer interface at Primer3Plus [2] (in the solution below Primer3Plus was used).

Handicon.png Note the many configurable parameters of Primer3!

In the results page you can see the following:
The program has selected 5 primer pairs, with the best at the top.
For each primer you can find:

  • the location on the sequence where the first base of the primer binds (for the forward/left primer this is on the complementary strand)
  • its length
  • its melting temperature
  • its %GC
  • the score of an alignment of the primer with the other primer and with itself

The program first considers all possible primers and retains those that satisfy a series of criteria, then considers all possible pairs of these primers and retains those who satisfy a series of criteria. At the bottom of the page you can see information about how many primers were considered and rejected/retained.

Look at the 8 highest scoring hits for the forward primer.

So it's always a better idea to Blast against Refseq sequences unless they are not available yet for the organism you work with or you have reasons to believe that they are not complete (i.e. they do not represent the full genome). However, for model organisms, full chromosome sequences are available so you can use Refseq for BLAST.

Go to the results of the reverse primer.


BLAST34.png

When evaluating the specificity of the primers, it is especially important to check that the primers are specific at the 3' end because that's the site where the polymerase will attach nucleotides.
So generally, it is recommended to not use primers that contain long identical stretches (>15 nt for primers of 20 nt long) to other regions in the genome apart from the one you want to amplify/measure, and certainly not if these stretches comprise the last nucleotide at the 3' end of the primer.
When you look at the alignment of the first hit on chromosome X of the reverse primer you see that this is a borderline case: an identical stretch of 15nt including the nucleotide at the 3' end !


BLAST35.png

Designing regular PCR primers using PrimerBLAST

It's a lot of work to design primers using Primer3/CLC Main Workbench especially if you discover during the BLAST that the suggested primers are not specific and you have to start all over again.

Therefore, we will now design primers using the PrimerBLAST tool [3] of NCBI. PrimerBLAST uses the Primer3 algorithm to find primers allowing you to set all the parameters related to effciency of the primers (e.g. secondary structure and dimer formation, Tm, difference in Tm...). On top of this, PrimerBLAST uses the BLAST algorithm to find specific primers, that will only bind to the intended target. It is the only tool that incorporates the BLAST in the primer search.

For some applications, it is vital to have specific primers. This is for instance the case when you use PCR for quantification (qPCR) or when you want to amplify a commonly occurring sequence from the genome.

As an example, we will design primers to amplify a region of RefSeq entry NM_079400, which represents the D. melanogaster mRNA coding for tap, the target of Poxn. Tap encodes a bHLH protein expressed in larval chemosensory organs and involved in the response to sugar and salt. We wish to amplify the region encoding the Helix-loop-helix domain by PCR. In the nucleotide sequence of the RefSeq record, the domain is located between position +577 and +745.

So, let's go to the PrimerBLAST page.

Remember that PrimerBLAST also allows you to set the parameters related to specificity of the primers.

Only primers that fulfill the specificity criteria that you have set will be shown.

Click the Get Primers button (purple) to start the search. After a few seconds the search returns 5 specific high quality primer pairs:


PrimerBLAST3.png

Knowing that PrimerBLAST uses the same algorithm to pick primers as Primer3 [4], PrimerBLAST will save you a lot of work and time when you have to design primers.

Designing primers for qPCR

Exercises on qPCR primer design can be found on our primer design wiki page.


References:
  1. http://frodo.wi.mit.edu/
  2. http://primer3plus.com/cgi-bin/dev/primer3plus.cgi
  3. http://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi?LINK_LOC=BlastHome
  4. http://www.ncbi.nlm.nih.gov/tools/primer-blast/primerinfo.html