Gene Regulation

From BITS wiki
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

*Gene regulation via Ensembl

At lot of sequence analysis has already been done for you in specialized sequence databases such as Ensembl. Ensembl provides many tracks with information on gene expression and its regulation especially for human and mouse (based on data from the ENCODE project).

These tracks can be found in the Regulation section and consist of genomic regions that could be involved in gene transcriptional regulation, inferred from publicly available experimental data sets, mainly DNase-seq and ChIP-seq experiments (see slides). Ensembl has a pipeline to analyze these data sets across different cell types.

DNA methylation

CpG islands are genomic regions that contain a high frequency of CG dinucleotides and are often located near the promoters of mammalian genes. There are tools that predict potential CpG islands based on nucleotide composition. These predictions can be visualized in Ensembl. However, Ensembl also provides tracks on hyper- and hypo-methylated CpGs from the ENCODE project based on experimental Bisulfite Sequencing data of 47 cell lines.

The CpG islands track contains computational predictions of CpG islands.

The DNA methylation track in the Regulation section of tracks visualizes the experimentally determined CpG islands.

DNA accessibility

An important mechanism of gene regulation is the accessibility of the DNA. Ensembl provides several tracks for visualizing the status of the chromatin, based on ENCODE data from large scale sequencing of DNA regions that were cut by DNase. DNase can only cut open chromatin regions.

Histone modifications

DNA accessibility can also be regulated by modification of histone proteins. DNA winds itself around these proteins and as such they regulate the accessibility. If the DNA is tightly wound around them, it is not accessible. If the DNA loosens from the histones it becomes accessible. How well the DNA can bind to the histones is determined by the structure of the histones: i.e. which modifications are on the surface of the histones ? Methyl groups, acyl groups or other modifications all influence the binding of the DNA.

ENCODE performed ChIP-Seq experiments with antibodies that target modified histone proteins to identify the DNA regions that are bound to them.

Methylation of the fourth amino acid residue (K) from the N-terminus of histone H3 is one of the most studied histone modifications, and with good reason: it’s tightly associated with the promoters of active genes. Like all lysine residues, H3K4 can be mono, di, or tri methylated. H3K4me3 is the strongest activator of expression.

Transcription start sites

Possible transcription start sites can be predicted using the Eponine tool. The Eponine predictions track is available in Ensembl.

Similarly you can also define TSS as the regions that bind to the RNA polymerase. ENCODE performed ChIP-Seq experiments on various cell type samples to identify DNA regions that bind to RNA polymerase.


Finally, you can also search for binding sites of specific TFs in Ensembl, again based on ENCODE data.

*RNA processing via miRWalk

MiRWalk is a database that contains information on miRNAs from Human, Mouse and Rat. For each miRNA, it contains predicted as well as experimentally validated binding sites in their target genes.

In a mouse study, you find that the expression of TIMP3 influences the development of atherosclerosis. A decreased expression of TIMP3 stimulates inflammation of the arterial endothelium and triggers atherosclerosis.

You have searched for TFBS for TFs that might be the cause of the TIMP3 repression but to no avail. Maybe the expression of TIMP3 is regulated via other mechanisms, like miRNA induced degradation.

For decades, scientists have assumed that miRNAs bind to sequences in the 3'UTR of target genes.

The results page contains

  • Gene information: a list with links for the gene to retrieve data including genomic location, synonyms, Refseq IDs, homology information, external links, information on protein families, functional annotation...
  • Putative miRNA-target interaction information: for each algorithm you selected a list with potential miRNA binding sites within the sequences of the regions you searched in. If you searched in a region, you can click its link: Promoter, 5'-UTR, CDS and 3'-UTR.

When you click the results for the 3'UTR, you indeed get a list of miRNAs that in theory could bind to the 3'UTR of TIMP3 .

You spend a lot of time checking if these miRNA are the cause of the decreased TIMP3 expression in atherosclerosis but you find no link at all. Quite frustrated you decide to perform a final analysis with an open mind. Although everybody states that miRNAs target 3'UTRs, you decide to scan the CDS for potential targets.

miRWalk allows you to search in any component of a gene: 3'UTR, CDS, 5'UTR and promoter.

You get a list of miRNAs that in theory could bind to the CDS of TIMP3, one of them is mmu-mir-712 (red) .


  • The Gene column contains links to NCBIs Gene database.
  • The RefSeq ID column contains links to NCBIs RefSeq database.
  • The miRNA column contains links to miRbase.
  • Note the last column containing p-values !

You go back to the lab and find that atherosclerosis activates mir-712, which in its turn decreases the expression of TIMP3, thereby worsening the atherosclerosis.

Handicon.png It is important to search miRNA binding sites within the complete sequence (promoter, 5'-UTR, CDS and 3'-UTR) of a gene !

Now we want to know if mmu-mir-72 has other targets. For this we need to do the inverse analysis: start from a miRNA and predict which genes it targets. Fortunately, miRWalk can perform both analyses.

You get a list of CDS that can bind to mmu-mir-712, one of them is Pla2g6 (red).


You go and search for a link between this gene and atherosclerosis in Pubmed and bingo: 10 publications containing the name of this protein and atherosclerosis.

Finding PSMs of TFBS


TRANSFAC is a database on eukaryotic regulatory DNA elements and transcription factors. Most of its content is commercial (you have to pay for access), but there is a free public version. It is quite old but Transfac data is considered high quality so it's always worthwhile to take a look. It consists of the following tables:

  • FACTOR: proteins that regulate transcription by interaction with motifs in the DNA and miRNAs that control translation of mRNA
  • GENE: genes in which TF-binding sites or ChIP fragments were found and/or genes encoding for TFs or miRNAs
  • SITE: individual TF binding sites in eukaryotic genes
  • MATRIX: position score matrices for the TF binding sites
  • ...

The matrices have been obtained by genome wide experiments or they are compiled from individual sites, that were obtained by small scale experiments. To get sufficient sites for the compilation, sites from related organisms e.g. vertebrates, plants, insects... are combined. Some matrices were constructed by computational methods. The matrix identifier start with a letter that indicates one of six groups of biological species:

  • V$: vertebrates
  • I$: insects
  • P$: plants
  • F$: fungi
  • N$: nematodes
  • B$: bacteria

The species identifier is followed by an acronym of the TF and

  • a number discriminating between matrices for the same TF e.g. V$OCT1_02 is the second matrix for vertebrates Oct-1 TF
  • matrices generated by compilation of TRANSFAC SITE entries end with _Q followed by the lowest quality of the sites used to construct the matrix e.g. V$CREB_Q2 is a matrix constructed of CREB binding sites of quality 2 or better '(see below for an explanation of the quality scores).
  • a matrix with an identifier that ends with _C are generated by compilation of consensus sequences using ConsIndex (Frech et al., Nucleic Acids Res. 21:1655-1664, 1993)


For the matrices with _01... (the ones that are not generated by compilation) you should always check the corresponding Pubmed record (you'll find the link in the RX field of the TRANSFAC record) to see where it comes from. Entirely computational methods or SELEX experiments are not considered as thrustworthy as ChIP experiments (allthough the quality of the ChIP strongly depends on the specificity of the antibody that was used).

The quality scores range from 1 to 6 and reflect the experimental reliability of a site. The scores have the following meaning:

  • 1: functionally confirmed factor binding site
  • 2: binding of pure protein (purified or recombinant)
  • 3: immunologically characterized binding activity of a cellular extract
  • 4: binding activity characterized via a known binding sequence
  • 5: binding of uncharacterized extract protein to a bona fide element
  • 6: no quality assigned

Searching the TRANSFAC public database 7.0 is relatively simple and consists of the following steps:

  • Create an account or if you already have one, log in
  • Go to the Databases page
  • Select the TRANSFAC database 7.0 - public database to search in
  • Select the table to search in
  • Provide a term to search for
  • Select the field (like we saw in Genbank records) to search in


Find regulatory motifs in promoters


For motif searches you can use RSAT, the Regulatory Sequences Analysis Tool developed and maintained at ULB.

Retrieving promoter sequences in RSAT

When the promoter sequences are retrieved you can copy and paste them to Notepad.

Searching for known motifs in RSAT

Phylogenetic footprinting with Contra

Go to the Contra website. Contra predicts known motifs in a promoter based on phylogenetic footprinting. It not only uses similarity to the PSM of the motif but also the amount of conservation at the site to make its predictions. To this end it uses precomputed alignments from UCSC (the American counterpart of Ensembl).

Contra has two modes:

  • Visualisation search for specific (conserved) transcription factor binding sites in your gene of interest
  • Exploration is used when you have no idea which transcription factors regulate your gene of interest. ConTra will show a list of all TFs ranked by their binding probability to the promoter. This binding probability is determined by a score that takes into account the number of predicted binding sites for that TF, and the phylogenetic depth of each predicted site (~ defined as the number of other species in the alignment that have a predicted binding site for the same TF in a window of 200% site length on each side of the site).

The results of Contra are easy to interpret. The UCSC alignments are divided in alignment blocks. For every block a figure is shown. If TFBS were detected these are highlighted. For every block there is a link to a separate result page.

Using biophysical properties of the DNA using Physbinder

Go to the Physbinder website. The algorithm uses both similarity with the PSM and biophysical properties such as the bendability of the DNA to identify transcription factor binding sites.

Using the tool consists of two steps:

  • Submit the sequence(s)
  • Choose from more than 60 different TF binding models
The results are shown in an intuitive visualization.


Protocols in Motiflab

The first thing you need to do is creating a new protocol (= analysis). I want to know which TFs might be responsible for the observed upregulation of these genes in response to stress.

If you want to repeat your analysis later, you can document everything you do now via the record function.

Import sequences to search in

Now you need to import the sequences of the promoters you want to analyze.

Define motifs to search for

Next you need one or multiple motifs. Jaspar and the public part of Transfac are by default in MotifLab but you can add your own or other public motif collections.

Scan for motifs in the sequences

Mask repeat regions

DNA sequences contain repeat regions that might interfere with the motif searches. It is recommended to remove repeat regions from your sequences. MotifLab allows you to do so.

Motif searches always generate an enormous amount of false positives (regions in the DNA sequence that are similar to the motif but do not actually bind to the TF). You can reduce the number of false positives by focussing on regions in the promoter that are conserved between species. This is called phylogenetic footprinting. It is assumed that regions that are functionally important (like TFBS) do not change during evolution because if they change, their function would be lost.

MotifLab allows you to add feature annotation tracks to display and use annotation of the sequences, like the presence of conserved regions, repeats, open chromatin regions...

We will use the following tracks: Conservation and RepeatMasker

Phylogenetic footprinting

Now you can use the annotation to remove motifs that are located in repeat regions or in regions with low conservation.

You can do the same with other annotations e.g. remove motifs in closed chromatin regions...

Add tools to MotifLab

You should, however, when you do this for real, use a better algorithm than the Simple Scanner. You can for instance download and add Clover.

Now clover is installed you can try it out on the data !

Finding enriched TF binding motifs in a list of genes via Pscan

Use Pscan to identify enriched TFBS in a list of promoters.

The output shows the ranking of the TFs selected according to the enrichment p-values of their PWM. By clicking on a matrix name, you can open a dedicated page showing detailed results: matrix, logo, and links to the database entry as well as to the ID (PMID) of the PubMed entry describing the generation of the PWM...


You also get a simple graphic representation showing the average matching value of the matrix in the sequences in your list of targets (top) compared to the average matching value and standard deviation (in green) on a set of all promoters (same size of regions with respect to the TSS as selected) of the same organism (bottom).


Sample statistics shows the statistics of the enrichment analysis: the p-value, the Bonferroni corrected p-value, mean and standard deviation of the matching score on the list of targets, and number of sequences in the list.

The p-value should be read as: "If we take as many random promoters from the same organism as in the input set, what is the probability of having the same score as obtained in the input set?"

Finding novel motifs

In the exercises we searched for known TF binding motifs but often you do not know the motif that regulates the expression of your genes of interest and you have to search for unknown motifs. There are several tools to solve the complex problem of de novo identification of motifs that are enriched in a list of sequences. The wikipedia page has a very nice overview on these tools (see [1]).

What you need as an input is a list of promoters of which you are more or less certain that they all bind to the same TF, you just don't know which TF. If you have such promoter sequences you can search for common motifs that occur in all promoters.

Besides the very popular MEME suite, you can also use Info-Gibbs, one of the RSAT tools for this, although MEME often generates superior results. MEME is a suit of tools for various applications: finding novel motifs, finding (enriched) known motifs, comparing motifs...

Finding novel motifs with MEME

MEME is the default tool for finding novel motifs in a set of sequences. The tool looks for short sequences that are enriched in the list of input sequences. Enriched means they occur more than you would expect by chance. The suite contains variations on the MEME tool for specific applications: finding short or gapped motifs, finding motifs in large data sets...

ChIP-Seq analysis

You can find a tutorial of the first steps in the ChIP-Seq analysis workflow:

  • Obtaining public ChIP-Seq data
  • Quality control
  • Mapping
on the ChIP-Seq analysis page. The ChIP-Seq analysis workflow was created by Morgane Thomas-Chollier.