- 1 *Gene regulation via Ensembl
- 2 *RNA processing via miRWalk
- 3 Finding PSMs of TFBS
- 4 Find regulatory motifs in promoters
- 4.1 RSAT
- 4.2 Phylogenetic footprinting with Contra
- 4.3 Using biophysical properties of the DNA using Physbinder
- 4.4 MotifLab
- 4.5 Finding enriched TF binding motifs in a list of genes via Pscan
- 4.6 Finding novel motifs
- 5 ChIP-Seq analysis
*Gene regulation via Ensembl
At lot of sequence analysis has already been done for you in specialized sequence databases such as Ensembl. Ensembl provides many tracks with information on gene expression and its regulation especially for human and mouse (based on data from the ENCODE project).
These tracks can be found in the Regulation section and consist of genomic regions that could be involved in gene transcriptional regulation, inferred from publicly available experimental data sets, mainly DNase-seq and ChIP-seq experiments (see slides). Ensembl has a pipeline to analyze these data sets across different cell types.
|Go to region from bp 52,000,000 to 52,200,000 on human chromosome 4.|
|Locate the 5’ end of the SGCB transcript ?|
|First of all, locate the 5' end of the SGCB transcript: in the bottom graphic you see that SGCB is located under the dark blue line representing the genome sequence (contig). That means that the gene is located on the reverse strand.|
CpG islands are genomic regions that contain a high frequency of CG dinucleotides and are often located near the promoters of mammalian genes. There are tools that predict potential CpG islands based on nucleotide composition. These predictions can be visualized in Ensembl. However, Ensembl also provides tracks on hyper- and hypo-methylated CpGs from the ENCODE project based on experimental Bisulfite Sequencing data of 47 cell lines.
The CpG islands track contains computational predictions of CpG islands.
|Is there a CpG island predicted at the 5’ end of the SGCB transcript ?|
Scroll down at the most detailed display, you see that there is indeed a CpG island located at the 5’ end of the SGCB transcript.
The DNA methylation track in the Regulation section of tracks visualizes the experimentally determined CpG islands.
|Is the predicted CpG island backed up by experimental data in H1ESC cells ?|
When you scroll down at the most detailed display, you see that the bisulphite sequencing identifies a methylated region that overlaps with the predicted CpG island.
An important mechanism of gene regulation is the accessibility of the DNA. Ensembl provides several tracks for visualizing the status of the chromatin, based on ENCODE data from large scale sequencing of DNA regions that were cut by DNase. DNase can only cut open chromatin regions.
|Is the chromatin in the vicinity of the SCGB gene open or closed in H1ESC cells ?|
When you scroll down at the most detailed display, you see that the DNase method identifies an open chromatin region that overlaps with the CpG island.
DNA accessibility can also be regulated by modification of histone proteins. DNA winds itself around these proteins and as such they regulate the accessibility. If the DNA is tightly wound around them, it is not accessible. If the DNA loosens from the histones it becomes accessible. How well the DNA can bind to the histones is determined by the structure of the histones: i.e. which modifications are on the surface of the histones ? Methyl groups, acyl groups or other modifications all influence the binding of the DNA.
ENCODE performed ChIP-Seq experiments with antibodies that target modified histone proteins to identify the DNA regions that are bound to them.
|Is the regulatory region at the 5' end of the SCGB transcript bound to methylated H3K4 ?|
When you scroll down at the most detailed display, you see that the region is bound by double and triple methylated H3 in embryonic stem cells.
Methylation of the fourth amino acid residue (K) from the N-terminus of histone H3 is one of the most studied histone modifications, and with good reason: it’s tightly associated with the promoters of active genes. Like all lysine residues, H3K4 can be mono, di, or tri methylated. H3K4me3 is the strongest activator of expression.
Transcription start sites
Possible transcription start sites can be predicted using the Eponine tool. The Eponine predictions track is available in Ensembl.
|Is there a transcription start site predicted by Eponine annotated for the SGCB transcript?|
When you scroll down at the most detailed display, you see that Eponine indeed predicts a transcription start site almost exactly at the same location where the Ensembl SGCB transcript starts.
Similarly you can also define TSS as the regions that bind to the RNA polymerase. ENCODE performed ChIP-Seq experiments on various cell type samples to identify DNA regions that bind to RNA polymerase.
|Does the experimentally determined TSS correspond to the one that was predicted by Eponine ?|
When you scroll down at the most detailed display, you see that the fragment that binds to PolII according to the ChIP-Seq experiment indeed overlaps with the predicted TSS.
Finally, you can also search for binding sites of specific TFs in Ensembl, again based on ENCODE data.
|Which TFs have a binding site in the CpG island of the SGCB gene in H1ESC cells ?|
This track contains information from ENCODE ChIP-Seq experiments. As such it specifies the location of TFBS in the genome. You see a matrix with TFs in the rows and cell types in the columns.
Hover your mouse over the H1ESC. A box will appear in which you can select to visualize all TFs for which data are available will on the track.
Close the configure page
You can see that Yy1 has a binding site in the CpG island. Yy1 represses transcription by directing histone modifying proteins to promoter regions.
*RNA processing via miRWalk
MiRWalk is a database that contains information on miRNAs from Human, Mouse and Rat. For each miRNA, it contains predicted as well as experimentally validated binding sites in their target genes.
In a mouse study, you find that the expression of TIMP3 influences the development of atherosclerosis. A decreased expression of TIMP3 stimulates inflammation of the arterial endothelium and triggers atherosclerosis.
You have searched for TFBS for TFs that might be the cause of the TIMP3 repression but to no avail. Maybe the expression of TIMP3 is regulated via other mechanisms, like miRNA induced degradation.
For decades, scientists have assumed that miRNAs bind to sequences in the 3'UTR of target genes.
|Is mouse TIMP3 a target of a known miRNA in its 3'UTR ?|
The results page contains
- Gene information: a list with links for the gene to retrieve data including genomic location, synonyms, Refseq IDs, homology information, external links, information on protein families, functional annotation...
- Putative miRNA-target interaction information: for each algorithm you selected a list with potential miRNA binding sites within the sequences of the regions you searched in. If you searched in a region, you can click its link: Promoter, 5'-UTR, CDS and 3'-UTR.
When you click the results for the 3'UTR, you indeed get a list of miRNAs that in theory could bind to the 3'UTR of TIMP3 .
You spend a lot of time checking if these miRNA are the cause of the decreased TIMP3 expression in atherosclerosis but you find no link at all. Quite frustrated you decide to perform a final analysis with an open mind. Although everybody states that miRNAs target 3'UTRs, you decide to scan the CDS for potential targets.
|Do you find miRNA targets in the CDS of TIMP3 ?|
Fill in the form as follows and click SEARCH:
miRWalk allows you to search in any component of a gene: 3'UTR, CDS, 5'UTR and promoter.
You get a list of miRNAs that in theory could bind to the CDS of TIMP3, one of them is mmu-mir-712 (red) .
- The Gene column contains links to NCBIs Gene database.
- The RefSeq ID column contains links to NCBIs RefSeq database.
- The miRNA column contains links to miRbase.
- Note the last column containing p-values !
You go back to the lab and find that atherosclerosis activates mir-712, which in its turn decreases the expression of TIMP3, thereby worsening the atherosclerosis.
Now we want to know if mmu-mir-72 has other targets. For this we need to do the inverse analysis: start from a miRNA and predict which genes it targets. Fortunately, miRWalk can perform both analyses.
|Does mmu-mir-712 target other CDS ?|
The miRNA-target search page is organized similar to the Gene-target search page. You select a species, provide an ID of the miRNA, select the regions you want to search in, set the options, select the algorithms you want to use for the predictions and click SEARCH.
You get a list of CDS that can bind to mmu-mir-712, one of them is Pla2g6 (red).
You go and search for a link between this gene and atherosclerosis in Pubmed and bingo: 10 publications containing the name of this protein and atherosclerosis.
Finding PSMs of TFBS
TRANSFAC is a database on eukaryotic regulatory DNA elements and transcription factors. Most of its content is commercial (you have to pay for access), but there is a free public version. It is quite old but Transfac data is considered high quality so it's always worthwhile to take a look. It consists of the following tables:
- FACTOR: proteins that regulate transcription by interaction with motifs in the DNA and miRNAs that control translation of mRNA
- GENE: genes in which TF-binding sites or ChIP fragments were found and/or genes encoding for TFs or miRNAs
- SITE: individual TF binding sites in eukaryotic genes
- MATRIX: position score matrices for the TF binding sites
The matrices have been obtained by genome wide experiments or they are compiled from individual sites, that were obtained by small scale experiments. To get sufficient sites for the compilation, sites from related organisms e.g. vertebrates, plants, insects... are combined. Some matrices were constructed by computational methods. The matrix identifier start with a letter that indicates one of six groups of biological species:
- V$: vertebrates
- I$: insects
- P$: plants
- F$: fungi
- N$: nematodes
- B$: bacteria
The species identifier is followed by an acronym of the TF and
- a number discriminating between matrices for the same TF e.g. V$OCT1_02 is the second matrix for vertebrates Oct-1 TF
- matrices generated by compilation of TRANSFAC SITE entries end with _Q followed by the lowest quality of the sites used to construct the matrix e.g. V$CREB_Q2 is a matrix constructed of CREB binding sites of quality 2 or better '(see below for an explanation of the quality scores).
- a matrix with an identifier that ends with _C are generated by compilation of consensus sequences using ConsIndex (Frech et al., Nucleic Acids Res. 21:1655-1664, 1993)
For the matrices with _01... (the ones that are not generated by compilation) you should always check the corresponding Pubmed record (you'll find the link in the RX field of the TRANSFAC record) to see where it comes from. Entirely computational methods or SELEX experiments are not considered as thrustworthy as ChIP experiments (allthough the quality of the ChIP strongly depends on the specificity of the antibody that was used).
The quality scores range from 1 to 6 and reflect the experimental reliability of a site. The scores have the following meaning:
- 1: functionally confirmed factor binding site
- 2: binding of pure protein (purified or recombinant)
- 3: immunologically characterized binding activity of a cellular extract
- 4: binding activity characterized via a known binding sequence
- 5: binding of uncharacterized extract protein to a bona fide element
- 6: no quality assigned
Searching the TRANSFAC public database 7.0 is relatively simple and consists of the following steps:
- Create an account or if you already have one, log in
- Go to the Databases page
- Select the TRANSFAC database 7.0 - public database to search in
- Select the table to search in
- Provide a term to search for
- Select the field (like we saw in Genbank records) to search in
|How to retrieve the PWM of a motif in JASPAR ?|
Find regulatory motifs in promoters
For motif searches you can use RSAT, the Regulatory Sequences Analysis Tool developed and maintained at ULB.
Retrieving promoter sequences in RSAT
|How to retrieve the promoter sequences of genes based on EnsemblIDs.|
In the left menu select the Retrieve EnsEMBL seq tool (under Sequence tools).
RSAT has a manual and tutorial for each tool: at the bottom of the page, next to the Go and the Demo buttons.
Next you need to set the parameters of the search:
When you have set the parameters click the GO button.
When the promoter sequences are retrieved you can copy and paste them to Notepad.
Searching for known motifs in RSAT
|How to do a quick matrix scan in RSAT ?|
In the left menu select the matrix-scan (quick) tool under Pattern matching
You need to specifiy the following parameters:
Note that the results table starts with an overview of the input sequences. This part of the table does not contain any motif results.
In the actual results of the motif search, you can see:
Phylogenetic footprinting with Contra
Go to the Contra website. Contra predicts known motifs in a promoter based on phylogenetic footprinting. It not only uses similarity to the PSM of the motif but also the amount of conservation at the site to make its predictions. To this end it uses precomputed alignments from UCSC (the American counterpart of Ensembl).
Contra has two modes:
- Visualisation search for specific (conserved) transcription factor binding sites in your gene of interest
- Exploration is used when you have no idea which transcription factors regulate your gene of interest. ConTra will show a list of all TFs ranked by their binding probability to the promoter. This binding probability is determined by a score that takes into account the number of predicted binding sites for that TF, and the phylogenetic depth of each predicted site (~ defined as the number of other species in the alignment that have a predicted binding site for the same TF in a window of 200% site length on each side of the site).
|How to search for a specific TFBS in a promoter with Contra ?|
Using Contra is quite straightforward:
The results of Contra are easy to interpret. The UCSC alignments are divided in alignment blocks. For every block a figure is shown. If TFBS were detected these are highlighted. For every block there is a link to a separate result page.
Using biophysical properties of the DNA using Physbinder
Go to the Physbinder website. The algorithm uses both similarity with the PSM and biophysical properties such as the bendability of the DNA to identify transcription factor binding sites.
Using the tool consists of two steps:
- Submit the sequence(s)
- Choose from more than 60 different TF binding models
Protocols in Motiflab
The first thing you need to do is creating a new protocol (= analysis). I want to know which TFs might be responsible for the observed upregulation of these genes in response to stress.
|How to create a new protocol ?|
|Click the New Protocol button in the top menu
If you want to repeat your analysis later, you can document everything you do now via the record function.
|How to record everything you do ?|
|Click the Record button in the top menu
This will record your actions so it becomes easy to repeat the analysis on the same or other data.
Import sequences to search in
Now you need to import the sequences of the promoters you want to analyze.
|How to import sequences ?|
You can import promoter sequences by just providing Gene IDs or Ensembl IDs. MotifLab is coupled to Ensembl and will fetch the appropriate promoter sequences automatically.
Click the Add Sequences button in the top menu
Paste the Ensembl IDs or import the file containing the IDs (red). Note that you can choose the exact sequence of the promoters relative to the transcription start or end site (green).
Define motifs to search for
Next you need one or multiple motifs. Jaspar and the public part of Transfac are by default in MotifLab but you can add your own or other public motif collections.
|How to import motifs ?|
|In the Motifs section of the left menu click the +sign and select MotifCollection.
As stated previously MotifLab has a number of predefined motif collections. Choose Jaspar Core and click OK.
MotifLab has now loaded all Jaspar core motifs as you see in the Motifs section of the left menu. We only want to use motifs from human. To do this
MotifLab creates a new motif collection called MotifCollection2 with only the 79 human motifs. Deselect Jaspar Core, deselect MotifCollection 2 and select it again. Check if all motifs from MotifCollection2 are selected.
Scan for motifs in the sequences
|How to scan the sequences for potential matches to the motifs ?|
Mask repeat regions
DNA sequences contain repeat regions that might interfere with the motif searches. It is recommended to remove repeat regions from your sequences. MotifLab allows you to do so.
Motif searches always generate an enormous amount of false positives (regions in the DNA sequence that are similar to the motif but do not actually bind to the TF). You can reduce the number of false positives by focussing on regions in the promoter that are conserved between species. This is called phylogenetic footprinting. It is assumed that regions that are functionally important (like TFBS) do not change during evolution because if they change, their function would be lost.
MotifLab allows you to add feature annotation tracks to display and use annotation of the sequences, like the presence of conserved regions, repeats, open chromatin regions...
We will use the following tracks: Conservation and RepeatMasker
|How to add these feature annotation tracks?|
|Click the Add Feature Data Track button in the top menu
This opens the Datatracks dialog. Here you can select the tracks you want to visualize. Tracks with red buttons are not available for the organism you are working on.
The annotation is now visualized.
Now you can use the annotation to remove motifs that are located in repeat regions or in regions with low conservation.
|How to filter motifs in repeat regions or regions with low conservation?|
|Right click the binding sites track and select Perform Operation -> Transform -> Filter
This opens the Filter dialog. Set the condition used for the filtering as follows
This will remove motifs in regions with low conservation.
Set the second condition as follows
Right the AND and select Change Operator to OR to remove motifs that are in low conserved regions OR that are located in repeats.
You see that this greatly reduces the number of motifs that are shown in the track. The only motifs that are retained are those in conserved regions and outside repeat regions.
You can do the same with other annotations e.g. remove motifs in closed chromatin regions...
Add tools to MotifLab
You should, however, when you do this for real, use a better algorithm than the Simple Scanner. You can for instance download and add Clover.
|Add Clover to MotifLab|
|Expand Configure in the top menu and select External programs
Click Add program from Repository
Select Clover and click Install.
Select Download and install and select Windows
Now clover is installed you can try it out on the data !
Finding enriched TF binding motifs in a list of genes via Pscan
Use Pscan to identify enriched TFBS in a list of promoters.
|How to perform a Pscan search on a list of promoters ?|
Paste the IDs of your target genes in the Insert Gene/Sequence ID list box. Pscan accepts the following gene or transcript identifiers: RefSeq (for human, mouse, and Drosophila, e.g. NM_000546) TAIR (e.g. AT1G08810) for Arabidopsis; SGD (e.g. YPL248C) for yeast.
Specify the source organism and the region you want to analyze (relative to the annotated TSS).
Click Run! to start the search.
The output shows the ranking of the TFs selected according to the enrichment p-values of their PWM. By clicking on a matrix name, you can open a dedicated page showing detailed results: matrix, logo, and links to the database entry as well as to the ID (PMID) of the PubMed entry describing the generation of the PWM...
You also get a simple graphic representation showing the average matching value of the matrix in the sequences in your list of targets (top) compared to the average matching value and standard deviation (in green) on a set of all promoters (same size of regions with respect to the TSS as selected) of the same organism (bottom).
Sample statistics shows the statistics of the enrichment analysis: the p-value, the Bonferroni corrected p-value, mean and standard deviation of the matching score on the list of targets, and number of sequences in the list.
The p-value should be read as: "If we take as many random promoters from the same organism as in the input set, what is the probability of having the same score as obtained in the input set?"
Finding novel motifs
In the exercises we searched for known TF binding motifs but often you do not know the motif that regulates the expression of your genes of interest and you have to search for unknown motifs. There are several tools to solve the complex problem of de novo identification of motifs that are enriched in a list of sequences. The wikipedia page has a very nice overview on these tools (see ).
What you need as an input is a list of promoters of which you are more or less certain that they all bind to the same TF, you just don't know which TF. If you have such promoter sequences you can search for common motifs that occur in all promoters.
Besides the very popular MEME suite, you can also use Info-Gibbs, one of the RSAT tools for this, although MEME often generates superior results. MEME is a suit of tools for various applications: finding novel motifs, finding (enriched) known motifs, comparing motifs...
Finding novel motifs with MEME
MEME is the default tool for finding novel motifs in a set of sequences. The tool looks for short sequences that are enriched in the list of input sequences. Enriched means they occur more than you would expect by chance. The suite contains variations on the MEME tool for specific applications: finding short or gapped motifs, finding motifs in large data sets...
|How to identify novel motifs with MEME ?|
You can find a tutorial of the first steps in the ChIP-Seq analysis workflow:
- Obtaining public ChIP-Seq data
- Quality control