Gene Regulation

Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

*Gene regulation via Ensembl

At lot of sequence analysis has already been done for you in specialized sequence databases such as Ensembl. Ensembl provides many tracks with information on gene expression and its regulation especially for human and mouse (based on data from the ENCODE project).

These tracks can be found in the Regulation section and consist of genomic regions that could be involved in gene transcriptional regulation, inferred from publicly available experimental data sets, mainly DNase-seq and ChIP-seq experiments (see slides). Ensembl has a pipeline to analyze these data sets across different cell types.

Go to region from bp 52,000,000 to 52,200,000 on human chromosome 4.

Zoom in on the SGCB transcript, include a bit of flanking sequence on both sides.
In the middle graphic select the Drag/Select button and draw with your mouse a box around the SGCB transcript. If you don't see the button, it means that Drag/Select has already been activated. Click Jump to region in the pop-up menu

Locate the 5’ end of the SGCB transcript ?
First of all, locate the 5' end of the SGCB transcript: in the bottom graphic you see that SGCB is located under the dark blue line representing the genome sequence (contig). That means that the gene is located on the reverse strand.

DNA methylation

CpG islands are genomic regions that contain a high frequency of CG dinucleotides and are often located near the promoters of mammalian genes. There are tools that predict potential CpG islands based on nucleotide composition. These predictions can be visualized in Ensembl. However, Ensembl also provides tracks on hyper- and hypo-methylated CpGs from the ENCODE project based on experimental Bisulfite Sequencing data of 47 cell lines.

The CpG islands track contains computational predictions of CpG islands.

Is there a CpG island predicted at the 5’ end of the SGCB transcript ?
Click Configure this page in the side menu Type cpg in the Find a track text box Select CpG islands Close the configure page Scroll down at the most detailed display, you see that there is indeed a CpG island located at the 5’ end of the SGCB transcript.

The DNA methylation track in the Regulation section of tracks visualizes the experimentally determined CpG islands.

Is the predicted CpG island backed up by experimental data in H1ESC cells ?
Click Configure this page in the side menu Scroll to the Regulation section in the left menu and click DNA methylation This opens a list of bisulphite sequencing experiments in different cell types. Select the one but last experiment: WGBS in H1ESC cells. These cells are embryonic stem cells (see overview of cell types used in ENCODE). Close the configure page When you scroll down at the most detailed display, you see that the bisulphite sequencing identifies a methylated region that overlaps with the predicted CpG island.

DNA accessibility

An important mechanism of gene regulation is the accessibility of the DNA. Ensembl provides several tracks for visualizing the status of the chromatin, based on ENCODE data from large scale sequencing of DNA regions that were cut by DNase. DNase can only cut open chromatin regions.

Is the chromatin in the vicinity of the SCGB gene open or closed in H1ESC cells ?
Click Configure this page in the side menu Scroll to the Regulation section in the left menu and click Open chromatin & TFBS This opens a matrix with proteins (TFs and DNase) in the rows and cell types in the columns. Select DNase in H1ESC cells. These cells are embryonic stem cells (see overview of cell types used in ENCODE). Close the configure page When you scroll down at the most detailed display, you see that the DNase method identifies an open chromatin region that overlaps with the CpG island.

Histone modifications

DNA accessibility can also be regulated by modification of histone proteins. DNA winds itself around these proteins and as such they regulate the accessibility. If the DNA is tightly wound around them, it is not accessible. If the DNA loosens from the histones it becomes accessible. How well the DNA can bind to the histones is determined by the structure of the histones: i.e. which modifications are on the surface of the histones ? Methyl groups, acyl groups or other modifications all influence the binding of the DNA.

ENCODE performed ChIP-Seq experiments with antibodies that target modified histone proteins to identify the DNA regions that are bound to them.

Is the regulatory region at the 5' end of the SCGB transcript bound to methylated H3K4 ?
Click Configure this page in the side menu Scroll to the Regulation section in the left menu and click Histones and polymerases This opens the matrix with proteins (polymerases and modified histone proteins) in the rows and cell types in the columns. Select H3K4m1 (single methylated) H3K4m2 (double methylated) and H3K4m3 (triple methylated) in H1ESC cells. Close the configure page When you scroll down at the most detailed display, you see that the region is bound by double and triple methylated H3 in embryonic stem cells.

Methylation of the fourth amino acid residue (K) from the N-terminus of histone H3 is one of the most studied histone modifications, and with good reason: it’s tightly associated with the promoters of active genes. Like all lysine residues, H3K4 can be mono, di, or tri methylated. H3K4me3 is the strongest activator of expression.

Transcription start sites

Possible transcription start sites can be predicted using the Eponine tool. The Eponine predictions track is available in Ensembl.

Is there a transcription start site predicted by Eponine annotated for the SGCB transcript?
Click Configure this page in the side menu Type eponine in the Find a track text box Select TSS (Eponine) Close the configure page When you scroll down at the most detailed display, you see that Eponine indeed predicts a transcription start site almost exactly at the same location where the Ensembl SGCB transcript starts.

Similarly you can also define TSS as the regions that bind to the RNA polymerase. ENCODE performed ChIP-Seq experiments on various cell type samples to identify DNA regions that bind to RNA polymerase.

Does the experimentally determined TSS correspond to the one that was predicted by Eponine ?
Click Configure this page in the side menu Scroll to the Regulation section in the left menu and click Histones and polymerases This opens again a matrix with proteins (polymerases and modified histone proteins) in the rows and cell types in the columns. You can choose between Pol II and Pol III. In eukaryotes, Pol III transcribes DNA to synthesize ribosomal 5S rRNA, tRNA and other small RNAs while Pol II catalyzes the transcription of DNA into mRNA, most snRNA and microRNA. We are interested in the generation of mRNA so we select PolII as a protein and H1ESC as cell type. Close the configure page When you scroll down at the most detailed display, you see that the fragment that binds to PolII according to the ChIP-Seq experiment indeed overlaps with the predicted TSS.

TFBS

Finally, you can also search for binding sites of specific TFs in Ensembl, again based on ENCODE data.

Which TFs have a binding site in the CpG island of the SGCB gene in H1ESC cells ?
Click Configure this page in the side menu In the Regulation section click Open chromatin and TFBS Set Filter on Transcription Factor This track contains information from ENCODE ChIP-Seq experiments. As such it specifies the location of TFBS in the genome. You see a matrix with TFs in the rows and cell types in the columns. Hover your mouse over the H1ESC. A box will appear in which you can select to visualize all TFs for which data are available will on the track. Close the configure page You can see that Yy1 has a binding site in the CpG island. Yy1 represses transcription by directing histone modifying proteins to promoter regions.

*RNA processing via miRWalk

MiRWalk is a database that contains information on miRNAs from Human, Mouse and Rat. For each miRNA, it contains predicted as well as experimentally validated binding sites in their target genes.

In a mouse study, you find that the expression of TIMP3 influences the development of atherosclerosis. A decreased expression of TIMP3 stimulates inflammation of the arterial endothelium and triggers atherosclerosis.

You have searched for TFBS for TFs that might be the cause of the TIMP3 repression but to no avail. Maybe the expression of TIMP3 is regulated via other mechanisms, like miRNA induced degradation.

For decades, scientists have assumed that miRNAs bind to sequences in the 3'UTR of target genes.

Is mouse TIMP3 a target of a known miRNA in its 3'UTR ?
Go to the MiRWalk website. Select Predicted Target Module in the top menu. Select Gene-miRNA targets in the drop down list. In the Step1 section, change Human to Mouse In the Step1 section, change Select database to Gene In the Step1 section, change synonym to Official symbol In the Step1 section, add TIMP3 to Paste Identifiers box In the Input parameters box of the Step3 section select the region of the gene to search in, by default this is set to 3'UTR In the same location, you can select the minimum length of the seeds used for starting the alignment (similar to what happens in BLAST) and the confidence level you want to work at (default p-value = 0.05: you allow 5% false positive predictions) In the Other databases box of the Step3 section you should choose at least two algorithms to compare their predictions. Click the SEARCH button.

The results page contains

Gene information: a list with links for the gene to retrieve data including genomic location, synonyms, Refseq IDs, homology information, external links, information on protein families, functional annotation...
Putative miRNA-target interaction information: for each algorithm you selected a list with potential miRNA binding sites within the sequences of the regions you searched in. If you searched in a region, you can click its link: Promoter, 5'-UTR, CDS and 3'-UTR.

When you click the results for the 3'UTR, you indeed get a list of miRNAs that in theory could bind to the 3'UTR of TIMP3 .

You spend a lot of time checking if these miRNA are the cause of the decreased TIMP3 expression in atherosclerosis but you find no link at all. Quite frustrated you decide to perform a final analysis with an open mind. Although everybody states that miRNAs target 3'UTRs, you decide to scan the CDS for potential targets.

Do you find miRNA targets in the CDS of TIMP3 ?
Go to the MiRWalk website. Select Predicted Target Module in the top menu. Select Gene-miRNA targets in the drop down list. Fill in the form as follows and click SEARCH:

miRWalk allows you to search in any component of a gene: 3'UTR, CDS, 5'UTR and promoter.

You get a list of miRNAs that in theory could bind to the CDS of TIMP3, one of them is mmu-mir-712 (red) .

The Gene column contains links to NCBIs Gene database.
The RefSeq ID column contains links to NCBIs RefSeq database.
The miRNA column contains links to miRbase.
Note the last column containing p-values !

You go back to the lab and find that atherosclerosis activates mir-712, which in its turn decreases the expression of TIMP3, thereby worsening the atherosclerosis.

It is important to search miRNA binding sites within the complete sequence (promoter, 5'-UTR, CDS and 3'-UTR) of a gene !

Now we want to know if mmu-mir-72 has other targets. For this we need to do the inverse analysis: start from a miRNA and predict which genes it targets. Fortunately, miRWalk can perform both analyses.

Does mmu-mir-712 target other CDS ?
Go to the MiRWalk website. Select Predicted Target Module in the top menu. Select MicroRNA-gene targets in the drop down list. The miRNA-target search page is organized similar to the Gene-target search page. You select a species, provide an ID of the miRNA, select the regions you want to search in, set the options, select the algorithms you want to use for the predictions and click SEARCH.

You get a list of CDS that can bind to mmu-mir-712, one of them is Pla2g6 (red).

You go and search for a link between this gene and atherosclerosis in Pubmed and bingo: 10 publications containing the name of this protein and atherosclerosis.

Finding PSMs of TFBS

Transfac

TRANSFAC is a database on eukaryotic regulatory DNA elements and transcription factors. Most of its content is commercial (you have to pay for access), but there is a free public version. It is quite old but Transfac data is considered high quality so it's always worthwhile to take a look. It consists of the following tables:

FACTOR: proteins that regulate transcription by interaction with motifs in the DNA and miRNAs that control translation of mRNA
GENE: genes in which TF-binding sites or ChIP fragments were found and/or genes encoding for TFs or miRNAs
SITE: individual TF binding sites in eukaryotic genes
MATRIX: position score matrices for the TF binding sites
...

The matrices have been obtained by genome wide experiments or they are compiled from individual sites, that were obtained by small scale experiments. To get sufficient sites for the compilation, sites from related organisms e.g. vertebrates, plants, insects... are combined. Some matrices were constructed by computational methods. The matrix identifier start with a letter that indicates one of six groups of biological species:

V$: vertebrates
I$: insects
P$: plants
F$: fungi
N$: nematodes
B$: bacteria

The species identifier is followed by an acronym of the TF and

a number discriminating between matrices for the same TF e.g. V$OCT1_02 is the second matrix for vertebrates Oct-1 TF
matrices generated by compilation of TRANSFAC SITE entries end with _Q followed by the lowest quality of the sites used to construct the matrix e.g. V$CREB_Q2 is a matrix constructed of CREB binding sites of quality 2 or better '(see below for an explanation of the quality scores).
a matrix with an identifier that ends with _C are generated by compilation of consensus sequences using ConsIndex (Frech et al., Nucleic Acids Res. 21:1655-1664, 1993)

For the matrices with _01... (the ones that are not generated by compilation) you should always check the corresponding Pubmed record (you'll find the link in the RX field of the TRANSFAC record) to see where it comes from. Entirely computational methods or SELEX experiments are not considered as thrustworthy as ChIP experiments (allthough the quality of the ChIP strongly depends on the specificity of the antibody that was used).

The quality scores range from 1 to 6 and reflect the experimental reliability of a site. The scores have the following meaning:

1: functionally confirmed factor binding site
2: binding of pure protein (purified or recombinant)
3: immunologically characterized binding activity of a cellular extract
4: binding activity characterized via a known binding sequence
5: binding of uncharacterized extract protein to a bona fide element
6: no quality assigned

Searching the TRANSFAC public database 7.0 is relatively simple and consists of the following steps:

Create an account or if you already have one, log in
Go to the Databases page
Select the TRANSFAC database 7.0 - public database to search in
Select the table to search in
Provide a term to search for
Select the field (like we saw in Genbank records) to search in

JASPAR

How to retrieve the PWM of a motif in JASPAR ?
Go to JASPAR Click the database you want to search in e.g. JASPAR CORE Vertebrata for human. Search for the TF Click the logo of the TF. Copy and paste the matrix in NotePad.

Find regulatory motifs in promoters

RSAT

For motif searches you can use RSAT, the Regulatory Sequences Analysis Tool developed and maintained at ULB.

Retrieving promoter sequences in RSAT

How to retrieve the promoter sequences of genes based on EnsemblIDs.
In the left menu select the Retrieve EnsEMBL seq tool (under Sequence tools). RSAT has a manual and tutorial for each tool: at the bottom of the page, next to the Go and the Demo buttons. Next you need to set the parameters of the search: In the Query organism menu select the organism you want to search in. In the text area of the Gene, transcript or protein IDs section paste the Ensembl ID(s) or upload a file containing Ensembl IDs of the genes whose promoters you want to retrieve. Ensembl genes often have many alternative transcripts, so you will get multiple promoter sequences when you use Ensembl Gene IDs as input (one for each transcript). If the promoters of the transcripts partly overlap, a portion of the retrieved sequences is redundant, which is not suitable for some analyses, like motif discovery. If you want to avoid redundant sequences, check Avoid redundant sequences due to alternative transcripts in the Type of sequence to retrieve settings. The Sequence position box of the Options for upstream or downstream sequence settings allows you to choose an upstream or downstream sequence. In most cases motifs are searched in upstream sequences but motifs have also been known to occur in downstream intergenic regions. The From box defines the length of the promoter. The Relative to feature options define the end position of the promoter (position 0): when you select CDS you will include the UTR in the promoter sequence. When you select mRNA you will not. The Prevent overlap with neighbouring gene option will cut the promoter at the start of the next gene if the distance to this neighbouring gene is smaller than the length specified in the From box. If you don't mind retrieving a sequence belonging to a neighbouring gene as long as it is not coding, use the option Prevent overlap with neighbouring ORF instead. When you have set the parameters click the GO button.

When the promoter sequences are retrieved you can copy and paste them to Notepad.

Searching for known motifs in RSAT

How to do a quick matrix scan in RSAT ?
In the left menu select the matrix-scan (quick) tool under Pattern matching You need to specifiy the following parameters: In the Sequences section paste the promoter sequence (in fasta format) or upload a fasta file containing promoter sequences. In the Matrix section you need to paste the PSM in the Matrix or matrices box, note that you also have to specifiy the Format of the PSM (which database does it come from?) In the Background section you need to specify the background: each organism has its own preferences for nucleotide usage: in some organism the motif TATAAT might be rare because the genome is GC rich or the combination TAT does not occur often, in other organisms TATAAT is occurring everywhere. To take this into account you define the background: the tool will calculate which nucleotide combinations occur frequently and which are rare in the genome that you're studying and use this as extra info when it calculates scores for the motif occurrences. RSAT uses Markov models as background models. The Markov order specifies how detailed the background is defined: if you set this to zero the tool will simply calculate the frequency of each nucleotide and use this to adjust the scores. If you set it to one the tool will calculate the frequencies of all possible dimers (AA, AT, AC...) and use this to adjust the scores. If you set it to 2 the tool will calculate althe frequencies of all possible trimers... Needless to say there are more trimers than dimers so the higher you set the order, the slower it becomes however the more accurate is also is. It can calculate these frequencies based on your input sequences but it contains pre-calibrated background models for all supported organisms. If you only have one or a few input sequences go for the latter and select the organism you work in. In the Scanning options section you need to specify the output: you should always ask to return p-values since it allows you to set a threshold on the p-value which is more intuitive than setting a threshold on a weight score. Set a stringent threshold like in the example you see below to reduce the number of false positives. The Sequence origin allows you to define the start position of the analysis (position 0): if you choose end the A in the ATG is considered position 0 (if you chose promoters upstream of ATG) end all motif positions will expressed relative to this A (so all negative numbers), if you choose start the first base of the promoter is considered position 0 and all motif positions will be expressed as positive numbers relative to the start of the promoter. The default in the field is the former option but if you want to visualize the results in IGV you have to go for the latter option since IGV cannot interpret negative positions. Note that the results table starts with an overview of the input sequences. This part of the table does not contain any motif results. In the actual results of the motif search, you can see: seq_id: the ID of the promoter sequence in which a motif was found ft_type: the results table starts with an overview of the input sequences (limit) followed by a list of occurrences of the motif (site) ft_name: name of the motif that was found strand: the matrix scan tool searches both strands of the input sequences start: start position of the motif in the promoter sequences relative to the sequence origin. The default origin is the end of the sequence hence the negative numbers here. end: end position of the motif in the promoter sequences relative to the sequence origin. sequence of the motif that was found in the promoter p-value: expresses the similarity of the found motif to the motif matrix taking into account the background model

Phylogenetic footprinting with Contra

Go to the Contra website. Contra predicts known motifs in a promoter based on phylogenetic footprinting. It not only uses similarity to the PSM of the motif but also the amount of conservation at the site to make its predictions. To this end it uses precomputed alignments from UCSC (the American counterpart of Ensembl).

Contra has two modes:

Visualisation search for specific (conserved) transcription factor binding sites in your gene of interest
Exploration is used when you have no idea which transcription factors regulate your gene of interest. ConTra will show a list of all TFs ranked by their binding probability to the promoter. This binding probability is determined by a score that takes into account the number of predicted binding sites for that TF, and the phylogenetic depth of each predicted site (~ defined as the number of other species in the alignment that have a predicted binding site for the same TF in a window of 200% site length on each side of the site).

How to search for a specific TFBS in a promoter with Contra ?
Using Contra is quite straightforward: Specify the organism you work in: Contra will consult the corresponding alignments from UCSC. Specify the gene you're interested in. If the gene encodes several transcipts, select the transcript to analyze. For every transcript the position of the transcription start site (TSS) is shown, the number of introns and the identifier (RefSeq of Ensembl). Specify the region you want to analyze. In Contra promoters are defined upstream of the TSS. You can also choose to analyze the 5' UTR, 3'UTR or any intron. The promoter size upstream of the TSS can be specified. For UTRs and introns the entire region is used. Select the stringency (red box on figure): the different stringencies balance sensitivity and accuracy. With a core match = 0.85 and a matrix match = 0.70 the detection will be highly sensitive but less accurate and some false positives may be included. A core match = 1.00 and matrix match = 0.95 will be very accurate but less sensitive and may not show some true binding sites. Select PSMs: ConTra uses PSMs from JASPAR, phyloFACTS, TRANSFAC and a Protein Binding Microarray (PBM) derived collection of homeodomain TF PSMs. First search for the PSM using a keyword (green box on figure). Then select the PSMs you want to use by switching the Add to analyze button (blue box on figure). Click Run to start the analysis.

The results of Contra are easy to interpret. The UCSC alignments are divided in alignment blocks. For every block a figure is shown. If TFBS were detected these are highlighted. For every block there is a link to a separate result page.

Using biophysical properties of the DNA using Physbinder

Go to the Physbinder website. The algorithm uses both similarity with the PSM and biophysical properties such as the bendability of the DNA to identify transcription factor binding sites.

Using the tool consists of two steps:

Submit the sequence(s)
Choose from more than 60 different TF binding models

The results are shown in an intuitive visualization.

MotifLab

Protocols in Motiflab

The first thing you need to do is creating a new protocol (= analysis). I want to know which TFs might be responsible for the observed upregulation of these genes in response to stress.

How to create a new protocol ?
Click the New Protocol button in the top menu

If you want to repeat your analysis later, you can document everything you do now via the record function.

How to record everything you do ?
Click the Record button in the top menu This will record your actions so it becomes easy to repeat the analysis on the same or other data.

Import sequences to search in

Now you need to import the sequences of the promoters you want to analyze.

How to import sequences ?
You can import promoter sequences by just providing Gene IDs or Ensembl IDs. MotifLab is coupled to Ensembl and will fetch the appropriate promoter sequences automatically. Click the Add Sequences button in the top menu Paste the Ensembl IDs or import the file containing the IDs (red). Note that you can choose the exact sequence of the promoters relative to the transcription start or end site (green). Click OK. In the Features section of the left menu you now see an icon representing your set of promoter sequences (red). Clicking the Visualization tab (green) will visualize the sequences.

Define motifs to search for

Next you need one or multiple motifs. Jaspar and the public part of Transfac are by default in MotifLab but you can add your own or other public motif collections.

How to import motifs ?
In the Motifs section of the left menu click the +sign and select MotifCollection. As stated previously MotifLab has a number of predefined motif collections. Choose Jaspar Core and click OK. MotifLab has now loaded all Jaspar core motifs as you see in the Motifs section of the left menu. We only want to use motifs from human. To do this Deselect Jaspar Core in the left menu (red) Add a new Motif Collection Open the Manual Selection tab (green) In the Filter section set Organisms = human and press Enter (blue) Select all filtered motifs (purple) MotifLab creates a new motif collection called MotifCollection2 with only the 79 human motifs. Deselect Jaspar Core, deselect MotifCollection 2 and select it again. Check if all motifs from MotifCollection2 are selected.

Scan for motifs in the sequences

How to scan the sequences for potential matches to the motifs ?
Right click the DNA icon in the Features section Select Perform Operation, Motif, motifScanning Now appears a window where you can set the parameters of the search. The Source is the collection of sequences you want to search in: in our case DNA You have to provide the name of the track you want to store the results in The Method specifies the algorithm you use for the searches. MotifLab has a number of built-in algorithms but you can also add tools via Configure -> External tools. We will use the SimpleScanner. The MotifCollection defines the set of motifs we want to search for, in our case MotifCollection2 the set of human motifs from Jaspar (red). The Percentage defines the miminum percentage similarity between the motif and the sequence. Set it to 90 (green). MotifLab will scan the sequences for motifs and displays them as a separate track under the sequences. When you hover your mouse over them you get extra information If you want to arrange the overlapping motifs neatly under each other right click the binding sites track and select Expand track.

Mask repeat regions

DNA sequences contain repeat regions that might interfere with the motif searches. It is recommended to remove repeat regions from your sequences. MotifLab allows you to do so.

Motif searches always generate an enormous amount of false positives (regions in the DNA sequence that are similar to the motif but do not actually bind to the TF). You can reduce the number of false positives by focussing on regions in the promoter that are conserved between species. This is called phylogenetic footprinting. It is assumed that regions that are functionally important (like TFBS) do not change during evolution because if they change, their function would be lost.

MotifLab allows you to add feature annotation tracks to display and use annotation of the sequences, like the presence of conserved regions, repeats, open chromatin regions...

We will use the following tracks: Conservation and RepeatMasker

How to add these feature annotation tracks?
Click the Add Feature Data Track button in the top menu This opens the Datatracks dialog. Here you can select the tracks you want to visualize. Tracks with red buttons are not available for the organism you are working on. Select the Conservation and RepeatMasker track by clicking them. You can select both of them simultaneously by holding the Ctrl key. The annotation is now visualized.

Phylogenetic footprinting

Now you can use the annotation to remove motifs that are located in repeat regions or in regions with low conservation.

How to filter motifs in repeat regions or regions with low conservation?
Right click the binding sites track and select Perform Operation -> Transform -> Filter This opens the Filter dialog. Set the condition used for the filtering as follows This will remove motifs in regions with low conservation. Click the + sign to add a second condition. Right click the AND and select Add New Condition to Group. Set the second condition as follows Right the AND and select Change Operator to OR to remove motifs that are in low conserved regions OR that are located in repeats. You see that this greatly reduces the number of motifs that are shown in the track. The only motifs that are retained are those in conserved regions and outside repeat regions.

You can do the same with other annotations e.g. remove motifs in closed chromatin regions...

Add tools to MotifLab

You should, however, when you do this for real, use a better algorithm than the Simple Scanner. You can for instance download and add Clover.

Add Clover to MotifLab
Expand Configure in the top menu and select External programs Click Add program from Repository Select Clover and click Install. Select Download and install and select Windows

Now clover is installed you can try it out on the data !

Finding enriched TF binding motifs in a list of genes via Pscan

Use Pscan to identify enriched TFBS in a list of promoters.

How to perform a Pscan search on a list of promoters ?
Paste the IDs of your target genes in the Insert Gene/Sequence ID list box. Pscan accepts the following gene or transcript identifiers: RefSeq (for human, mouse, and Drosophila, e.g. NM_000546) TAIR (e.g. AT1G08810) for Arabidopsis; SGD (e.g. YPL248C) for yeast. Specify the source organism and the region you want to analyze (relative to the annotated TSS). Click Run! to start the search.

The output shows the ranking of the TFs selected according to the enrichment p-values of their PWM. By clicking on a matrix name, you can open a dedicated page showing detailed results: matrix, logo, and links to the database entry as well as to the ID (PMID) of the PubMed entry describing the generation of the PWM...

You also get a simple graphic representation showing the average matching value of the matrix in the sequences in your list of targets (top) compared to the average matching value and standard deviation (in green) on a set of all promoters (same size of regions with respect to the TSS as selected) of the same organism (bottom).

Sample statistics shows the statistics of the enrichment analysis: the p-value, the Bonferroni corrected p-value, mean and standard deviation of the matching score on the list of targets, and number of sequences in the list.

The p-value should be read as: "If we take as many random promoters from the same organism as in the input set, what is the probability of having the same score as obtained in the input set?"

Finding novel motifs

In the exercises we searched for known TF binding motifs but often you do not know the motif that regulates the expression of your genes of interest and you have to search for unknown motifs. There are several tools to solve the complex problem of de novo identification of motifs that are enriched in a list of sequences. The wikipedia page has a very nice overview on these tools (see [1]).

What you need as an input is a list of promoters of which you are more or less certain that they all bind to the same TF, you just don't know which TF. If you have such promoter sequences you can search for common motifs that occur in all promoters.

Besides the very popular MEME suite, you can also use Info-Gibbs, one of the RSAT tools for this, although MEME often generates superior results. MEME is a suit of tools for various applications: finding novel motifs, finding (enriched) known motifs, comparing motifs...

Finding novel motifs with MEME

MEME is the default tool for finding novel motifs in a set of sequences. The tool looks for short sequences that are enriched in the list of input sequences. Enriched means they occur more than you would expect by chance. The suite contains variations on the MEME tool for specific applications: finding short or gapped motifs, finding motifs in large data sets...

How to identify novel motifs with MEME ?
Choose mode of action: choose normal to discover motifs enriched in one data set (user uploads one set of sequences) and choose discriminative to discover motifs enriched in one data set relative to a second (control) data set (user uploads two data sets) Specify whether you have biological sequences (DNA, RNA or protein) or something else Upload the set of input sequences (called primary sequences in MEME) Specify the expected frequency of the motifs: do you expect the motifs to occur just once or less in each sequence (e.g. you search for a specific signal: splice signal, polyA signal...) or multiple times (e.g. most TFBS). How many motifs should MEME return: always take more than 1.

ChIP-Seq analysis

You can find a tutorial of the first steps in the ChIP-Seq analysis workflow:

Obtaining public ChIP-Seq data
Quality control
Mapping

on the ChIP-Seq analysis page. The ChIP-Seq analysis workflow was created by Morgane Thomas-Chollier.