RNA-Seq analysis for differential expression
- 1 Experimental design
- 2 Investigating and preprocessing raw RNA-seq data
- 2.1 Galaxy
- 2.2 Store your data
- 2.3 Check the quality of the data and process it
- 2.4 The steps to perform
- 2.4.1 Get your data in a Galaxy history
- 2.4.2 Run Groomer on your data to check the format of the fastq files
- 2.4.3 Run FastQC on your data and examine the results
- 2.4.4 Fastq Quality trimmer to remove low quality regions
- 2.4.5 Fastqmcf to remove adaptors
- 2.4.6 Get sequences of possible contaminating sources
- 2.4.7 Filter out reads that match to contaminating sequences
- 2.4.8 Summarize the matched contaminating sequences
- 2.4.9 Subtract the sequences matching to contaminant databases with 'Filter sequences by ID'
- 2.4.10 Run FastQC on the cleaned reads
- 2.4.11 Create a workflow of all preprocessing steps
- 2.4.12 Gather the cleaned reads in one history
- 2.5 Questions
- 3 Mapping processed data
- 4 Extracting counts and investigating experimental factors
- 5 Detecting differential expression from a count table
- 6 Exploring the biology behind observed changes
As we have seen during the first part of the lectures (see the presentation slides and background on ), experimental design is key in RNA-seq for differential expression (DE) analysis.
However, designing an experiment is somewhat a circular problem. To design an experiment, you need already to have data, which is extrapolated in order to estimate how good your experimental design will perform. You can generate pilot data for this, or use data from similar experiments (same species, same tissues, same protocol,...). Many RNA-seq experiments run with the minimum number of samples required, without doing experimental design. However, it is always instructive to send your final data to Scotty, to be able to value your experimental design.
There are 2 good online resources to help you with designing experiments:
- Explanation on experimental design of University of Oregon: http://rnaseq.uoregon.edu/exp_design.html
- Scotty, an online tool to provide you a decision matrix for your experimental design.
We will explore the use of Scotty. We have pilot data for an experiment in Drosophila, with 2 conditions (control and treatment). We have already sequenced each condition twice, to an average of 16M reads.
First, check the contents of the file, by opening it on your computer.
The first column is the gene identifier. Next we have 4 columns, 'Control_1' and 'Test_1', and 'Control_2' and 'Test_2'. Scotty appears to be a strict on the format. So if you upload your own data, you might reformat the column names and their order to let Scotty accept the data.
Scotty can be accessed at http://euler.bc.edu/marthlab/scotty/scotty.php. First upload your data, as shown in the screenshot below. Click choose file and upload your data.
Scotty includes also cost in their analysis. We assume that for lab work, we consume €300 per sample, irrespective of control or treatment. We are going for 50 bp reads single-end. You can find cost estimation for sequencing on slide 63 in the presentation Part 1 of the training. Recalculate from these numbers how much euros 1M of reads cost.
Next, find out what Scotty means by alignment rate. For the alignment rate number, you can base your estimate on on slide 67 in the presentation Part 1 of the training.
For the remaining options:
- we want to detect at least 80% of expressed genes that are differentially expressed by a 2 fold change at p < 0.01
- we have a budget of €12000.
- Limit measurement bias by measuring at least 80% of genes with at least 50% of maximum power
- leave the other options default.
Some explanation on 'measurement bias' (i.e. the very last option on the page). Due to the count nature, genes with high counts can be detected with greater power than genes with low counts. If we will analyse for biology later on, based on lists of significantly increased gene activity, we need to make sure that we don't find groups of genes enriched (example, genes involved in RNA processing), who just happen to have very high counts. This can be because the statistical test will more easily tell whether these high numbers differ significantly than do lower counts. Please refer to the Scotty documentation if you want to find out more about this bias and maximum power.
Another way to think about it: if we are able (with some kind of future absolute precise quantification method) to measure all mRNA coming from one gene in a cell (imagine, 245601 transcripts), and compare it to the number of all RNA coming from the same gene of another cell (say, 201548 transcripts), we don't need a statistical test! We can just see which one of the 2 cells have most RNA for that gene. Whether this difference is biological, is another matter.
So, in Scotty, we can set a condition (very last option in the interface) that at least 50% of the genes need to have counts, such that the power to detect differences is more than 50% of the maximum achievable power.
Don't worry if you don't understand: Scotty will produce sensible outcome anyhow. If you want to understand, grab a statistics handbook, and think about power, and what it means to test for a difference.
Click to run Scotty. Be patient, it takes about 5 minutes for the results to be calculated.
Interpreting Scotty output
Please use your own Scotty results to answer following questions. (If you do not have results from your own, you might download these results (pdf) to work on).
|Look at the summary of findings. What is the estimated dispersion? What does these numbers mean?|
|Control samples replicate dispersion: 0.086449
Test samples replicate dispersion: 0.084936
|The dispersion is a measure for the additional variance observed, typically between biological replicates in RNA-seq count data.|
Look at the graph under 'Excluded experimental configurations'. This matrix shows the summary of Scotty's analysis. Together with the following graph, this provides rich information.
|If you want to spend all your money, which design would you pick?|
|8 replicates per condition, sequenced each to 43M reads deep. Note: the graph says 30M 'reads aligned': we expect 70% alignment, so 30/0.75 gives 43M.|
|Is there a way to reach our specifications, with only sequencing 3 replicates for each condition.|
|Yes. For this you should have the replicates sequenced to 143M reads.|
|If you want all genes (100%) to be above the 50% of the maximum power, which experimental design would you pick? How much would it cost?|
|Check the 'measurement biases by experimental configuration', and match it with 'cost by experimental configuration'. Nearly all genes are above 50% of the maximal power at 7 replicates sequenced to 70M deep.|
|This will cost between 20 to 25000 euros.|
2 more count tables are available on Galaxy as input for Scotty
You can find more count tables in the Shared Data of Galaxy. Log in and go to 'shared data' --> 'Galaxy RNA-seq for DE workshop 2014' --> '1. Experimental design'.
It might be that the count tables have to be reformatted to be accepted by Scotty (mainly the header line). The reformatting can be done in Galaxy, or after downloading and editing by hand.
Investigating and preprocessing raw RNA-seq data
The exercises that follow need to be performed in the BITS Galaxy. Galaxy is an online tool to run bioinformatics analyses, and store and share them in convenient ways. You should have received an account; if not, please contact the instructor. The BITS Galaxy has been modified to include tools you won't find on other Galaxies, so please use this Galaxy.
If you want to repeat the analysis that we will do in this training on your own data, you can import the tools from the BITS Galaxy into your own Galaxy.
Store your data
The first step of any RNA-seq experiment is to store the data you have received in a sensible way. Galaxy can help you structure your data: put it in a data library. Do not forget to add enough information, called meta-data in the statistical jargon.
A typical RNA-seq experiment generates between 10 and 30 files containing reads in .fastq format, each reaching 1 to 20 GB in size. The names of the files you receive contain information about the lane used on the machine, the multiplex index used and the whether the files are part of a paired-end run or not. See below for an example.
For you to exercise, I have stored 5 real life datasets in the Galaxy RNA-seq for DE workshop 2014 data library , under '2. Investigating raw RNA-seq data'.
This data is for you to use! Be aware: please, do NOT use complete datasets during the training. The server will crash. Use the sample sets available for each experiment. Feel free to look at the complete data at home, after the training.
We will during the training mainly focus on the Candida - SE - 4 conditions experiment.
Check the quality of the data and process it
Examine the datasets, preprocess them, and check again the improvement in quality at the end. The preprocessed reads will be used to be mapped in the next section.
The steps to perform
All tools for this section are available under NGS:QC and manipulation. For single end data sets (such as the one from Candida), follow these steps below.
For paired end datasets, you have to make sure that the preprocessing does not interrupt the link between 2 reads from one fragment. This can happen when one read is trimmed and removed because it is too short, while the other read is not trimmed, and kept in the library. Some QC tools are PE aware, others are very basic and do not care.
Get your data in a Galaxy history
Select the sample dataset of your choice from the data library and choose at the bottom 'Import to current history'. Click Go.
Run Groomer on your data to check the format of the fastq files
New data needs to be checked consistency: the tool Groomer takes care of this. Warning: can take a long time! For this session, all data has already been put through Groomer. Only use groomer on sample datasets during this session.
Run FastQC on your data and examine the results
After grooming, have a look at some QC metrics with FastQC.
It might be a good idea to create a workflow to let FastQC run in parallel on your data. See Creating a workflow in Galaxy.
Examen the results carefully! In this step you should detect strange deviations. Depending on what you observe, more or less removal or trimming of sequences is needed.
Fastq Quality trimmer to remove low quality regions
If your reads contain a lot of low quality regions, this tool can help you remove them.
Be aware that your sequences will not be uniform in length anymore. Some people do not like this, and want to chop off a fixed portion of each read in order to maintain the uniformity in length.
Fastqmcf to remove adaptors
In case you have detected adaptors in your data set, fetch the sequences of the adaptors in fasta format. You can search them online, or even assemble them from the FastQC report.
In case you want to align to a reference genome sequences (as in our case), and you have detected a clear presence of polyA - or polyT! - tails, those homopolymer stretches need to be filtered out. One way to do so (although not supported by the author of Fastq-mcf tool) is to provide some polyA and polyT stretches as 'pseudo-adapters'.
We have provided some adapter sequences in the data library.
Alternative tool for adaptor removal: clip
Get sequences of possible contaminating sources
Fetch the fasta files of the:
- non-coding RNA sequences
- mtDNA sequences
- phiX genome sequences
You can fetch these sequences by using Get Data and BioMart Central Server. At the end of the BioMart query, the results are automatically uploaded to Galaxy.
You can find how to fetch data from biomart into Galaxy on this page in the wiki: .
Optionally: these contaminating sequences are available already in the data library. However, we encourage you to get these sequences through Biomart into Galaxy yourself.
Filter out reads that match to contaminating sequences
Use Bowtie2 to match reads to contaminating sequences. Do not forget to indicate 'Write unaligned reads to separate file(s)'! We will continue with the reads that do not match the contaminating sequences.
PRO TIP It is a good idea to let Bowtie2 run in parallel, to match to rRNA, mtDNA and PhiX, and not serial (e.g. first rRNA, then mtDNA, then PhiX). Serial working leads to too much loss of information. Instead, we can summarize in a next step in Galaxy, for each read whether is has matched one of the three libraries. Next, we continue by extracting the reads that did not match any of the three libraries. I hope this makes sense.
Summarize the matched contaminating sequences
If the complete read set is matched to each of the contaminating sequence set, I want you to summarize the Bowtie2 results as follows:
HWI-ST571:278:C1AK2ACXX:5:1101:7020:36938 . Ca19-mtDNA CaalfMr16 HWI-ST571:278:C1AK2ACXX:5:1101:7027:14017 . . ITS2 HWI-ST571:278:C1AK2ACXX:5:1101:7034:70380 . . RDN25 HWI-ST571:278:C1AK2ACXX:5:1101:7050:64366 phix . .
This is easily done! First, we will convert the output of Bowtie2 from BAM to SAM for all results.
Next, filter all SAM files with the tool Filter SAM and select the option 'the read is unmapped':'no' to filter out the SAM lines of mapped reads.
Next, perform a 'Column join' on the filtered SAM files, using the first column (c1) as a hinge, and outputting the third column (c3).
We end with the required output. Easy, no?!
Subtract the sequences matching to contaminant databases with 'Filter sequences by ID'
Now, we have a list of contaminating reads. We filter out those reads from our initial complete list of reads. Use 'Filter sequences by ID' .
Run FastQC on the cleaned reads
The cleaning has been performed. Check your work!
Create a workflow of all preprocessing steps
Now, this was a lot of clicking and fizzling for just one reads file! How to do this for 20 of such files?
Make a workflow of the steps you performed on this sample fastq dataset. Next, you can run this workflow on all read sets. so they all get treated in the same way.
Gather the cleaned reads in one history
Use the function 'Copy datasets' to copy the cleaned reads to one history.
|From the Candida experiment, take the read sample set, and process it. How many reads that match mtDNA did you filter out ?|
|After removal of polyA and polyT tails, map the resulting read set to mtDNA with Bowtie2.|
|You get a BAM file as output.|
|Convert the BAM to SAM with the bam-to-sam tool. Now the output is in text.|
|Next, we filter the SAM to include only the mapped reads. SAM contains usually all reads, including those not mapped. However, this is dependent on the mapper used. Use the tool Filter SAM and set the options as shown below.|
|From the Filter SAM output, we see that 18 reads map to the mtDNA.|
|How many reads of the sample set match non-coding RNA?|
|Following the same approach of above, we find 6,183 reads matching to non-coding RNA.|
|Check whether filtering improved the quality?|
|After aligning the reads to non-coding RNA, mtDNA and PhiX DNA sequence, we get 3 SAM files.|
|We filter them for all mapped reads|
|Use Column join to get all mapped reads in one file.|
|Use the first column of this resulting column join with the tool Filter sequences by ID to filter out those reads from the original read file.|
|Now we can run FastQC on the cleaned reads.|
|To easily compare different FastQC outputs, use the scratch functionality of Galaxy. Press the raster button at the top|
|For this good quality data set, it is not clear from FastQC reports that filtering improved results. We see however that the reads are not anymore uniform in length.|
- Take a subsample of one of the samples the other experiments and run QC. Rank the experiment by quality? Which is the best? Which has the worst?
- ADVANCED: The paired-end reads of the SpeciesX data library are 'interlaced' into one file. Can you de-interlace them?
Mapping processed data
After cleaning the reads to a reasonable level, the reads need to be mapped with a gap-aware aligner, such as Tophat2 or Rna-STAR.
On Galaxy  we have provided some preprocessed read datasets, which you can use for experimenting with mapping. Alternatively, you can continue with your own cleaned reads from the previous section. In the data library, you can find additional required files, such as annotation files. Go check them out!
Confirmed datasets: Candida, Mouse
We will align to a reference genome. We need the genome for this, the cleaned reads, and a genome annotation file.
In case the reference genome of your interest is not available as a built in Galaxy, you can put a reference genome in your history (fasta) and align against that reference genome (just as we have done with Bowtie2).
Tophat2: mapping single-end reads
1. Create a new history
2. Get into this new history a sample of the cleaned reads (available in data library)
3. Get into this new history a genome annotation file (data library)
The first mapping is an always an educated guess: go over the options outlined below and apply them.
1. Select a reference genome: this should be clear to you.
2. Library Type is unstranded (by default)
3. Use own junctions
4. Use gene annotation model
5. Select your genome annotation file
6. Only look for supplied junctions
When the run is done, perform quality control, as described below. Visualize your results!
Note: In case you don't have a genome annotation file, you can let Tophat2 generate one for you by searching for possible splice junctions. Tophat2 will run much longer, but will attempt to find splice junctions, known and novel ones. However, since our goal is DE, and not novel isoform detection, it is advised to run Tophat2 always with genome annotation file.
Tophat2: Paired-end reads
With paired end reads, you have to specify to mean inner distance between the paired-end reads. It typically ranges from 80 to 300 bp for RNA-seq. You have to provide the standard deviation of this mean distance, for which 2/3 of this value will do as a rule of thumb.
After mapping, visualize the mapping and do QC!
Mapping with RNA-STAR
RNA-STAR is a new aligner, and is extremely promising. First, it aligns much faster than Tophat2. Second, it provides you a nice summary of the most relevant mapping results in a log file.
You can't provide a genome annotation file to RNA-STAR: RNA-STAR has only built-in genomes available. This is because the genome annotation information was used during indexing, and is contained in the indexed reference genome. We have 4 species available for the moment.
Solve following question to get a feeling of RNA-STAR.
|How many reads are mapped to multiple positions in the genome. Get a sample data set from Candida albicans in a new history. Map the reads with RNA-STAR. (It should take about 4 minutes).|
|Align the reads with the parameters below|
|Wait for the mapping to complete.|
|Check the mapping log.|
|% of reads mapped to multiple loci 4.31%|
Visualizing the mapping in Galaxy
The visualization tool in Galaxy is called 'Trackster'.
After you have generated a BAM or SAM file visualize it by by clicking on the data set in the history. If the data set is clicked open, there is a small visualization icon. Click on it and choose Trackster.
It will ask to open a new visualisation or an existing (saved) one. If this is the first time you viewed a dataset in this experiment, you can choose new. Give the visualisation a name and view it.
The visualization will open on a certain chromosome. With Candida on our Galaxy, this happens to be the mitochrondrial genome, for which we have filtered out the reads! At the top you can select another location.
If you want to view the genes on this genome, you can simply add the genome annotation file from our history.
Now you can browse around the genome, and look at how the mapping is performed. Especially have a look on the splice site detections (none present in Candida), and the position of the reads on the genome. The reads are divided in 2 colours, one for alignment to the forward and one for alignment to the reverse strand.
|We have to redo the aligning with TopHat2 in the case above case! Can you see why?|
|We clearly see that the reads are stranded: to one gene maps only reads with a certain directionality. Those reads have the same colour in Trackster.|
|We need to have set the 'Library Type' .|
|The right option we can derive from the visualisation: reads from the forward strand map to genes on the reverse strand. Hence: 'FR Second strand' is the option.|
Quality control tools for mapping
You can find all quality control tools for mapping under NGS: Mapping QC on our Galaxy. We have implemented Qualimap and RSeQC. Note: it might be that not all QC tools can be performed for all mapping and genome annotation files.
The goal of the mapping QC is to identify aberrations in the mapping process, and to check whether the assumptions about library preparations (e.g. paired-end, strandedness,...) hold, and how saturated the reads described our transcriptome.
To appreciate all tools, we encourage you to run these QC tools on different experiments in the Data Library.
Sequence alignment format (SAM) is an extended format to describe alignment results from, for example, Tophat. It has field available for mapping quality and for indication of unmapped reads.
However, it is entirely up to the mapper to fill in this SAM correctly. For example, Tophat only includes mapped reads into the SAM file. Thus when you perform QC on the SAM, it will report that 100% of the reads are mapped! Be aware when you see such numbers.
Input: BAM or SAM file.
The output is an HTML report with statistics of the alignment. One note: BamQC is mainly for DNA-seq, not RNA-seq. Some of the graphs (e.g. coverage over the complete reference sequence) is of no use for use, since we will have only coverage where active genes are located.
Nevertheless, it is a nice tool to run on your BAM/SAM especially in the paired-end case.
Input: BAM/SAM and corresponding GTF. Note: some of these tools give empty outputs in some cases, depending on the alignment program used and the genome annotation file provided. We apologize for this.
The overview of the tools can be found in the presentation of this section.
|What portion of the reads are mapped? What portion is unmapped?|
|Usually, this information can be fetched from the alignment log, which is one of the outputs. What if there's not such a log?|
|If it's not available, you check whether the SAM file contains all mapped and unmapped reads. Use 'filter SAM' to check this. What if the unmapped reads are not in the SAM file?|
|If the unmapped reads are not there, we can count how many entries are in the initially fastq file, and subtract the number of entries from the SAM.|
END OF DAY 1
Extracting counts and investigating experimental factors
We will make use of HTSeq-count, which is available on our Galaxy. To extract the counts we need an alignment result (usually BAM), and a genome annotation file. From the GTF file we will fetch the coordinates of the features to count.
Check the tool on a sample set
- Create an empty history with the name 'test count table extraction'.
- Go to the Data Library 'Galaxy RNA-seq for DE workshop 2014' and import from under '4. Extracting counts from a mapping result' one of the Candida sample alignment files.
- Import to the history the corresponding Candida GTF file from under '3. Mapping preprocessed data' (do not pick the GFF file!)
- Run HTSeq-count, with the appropriate options: Union mode of counting, stranded is 'reverse' (see the mapping visualization and QC to check), and 20 as minimum mapping quality. The features we want to count are 'exons', which need to be summarized by 'gene_id'. The naming of these features you can check in the GTF file. Execute!
- The tool on the sample Candida data set will take about 1 minute and 15 seconds.
- You have 2 outputs: one log file, and one count table for this sample.
|How many reads did HTSeq-count ignore because their alignment was not unique (i.e. multiple times mapped on the genome)?|
|We have a look at the log of HTSeq-count:|
Warning: HTSeq-count can only be as good as the BAM file provided. If the mapper you use do not provide enough alignment details, such as mapping quality, or rude flag fields, it might be that HTSeq-count needs to ignore more reads because of the lack of information. Try to use established alignment software. Definitely check the numbers of HTSeq-count!
|Which is the gene with the highest number of counts?|
|You can use the 'sort' tool in Galaxy.|
|It is the gene CAL0005721, with in my case 2844 counts (16 times higher than the gene with the third highest count!). To satisfy our curiosity: what does this gene do? Check it on candidagenome.org.|
|Indeed: 'unknown protein product'. We will leave it for now, and first construct the count table for all samples.|
Apply the same parameters on all samples
To do so, we will create a workflow and apply the workflow on the samples simultaneously.
- The easiest way to create a workflow: do your analyses, hereby creating a history.
- Convert the history to a workflow
- The workflow is available from your workflows menu at the top. Here you can edit your workflow (change parameters of the tools, add tools,...).
- Let's run the workflow. Create a new history named Candida count table extraction.
- Import all Sample alignment sets from Candida.
- Run the workflow on all samples in the history.
- After 1 minute 15 seconds all count tables should be calculated! Nice parallel work!
One thing you want to do, is to rename the generated count tables (with the pencil icon) to a sensible name.
Merging the count tables of the separate samples together
We have a count table per sample. For DE analysis, we need to merge the counts in one table.
| Check out the tool 'Column join' to create the experiment wide count table.|
|Use the help described below the tool to understand how it works.|
|You want to make sure that empty fields are filled with '0'.|
| These are the parameters|
|Add an appropriate header line 'gene_id sample1 sample2 sample3 sample4'|
|This needs 2 steps. First, use the 'Upload file' tool, and type the line in the box. Check 'convert spaces to tabs'.|
|Next, use the tool 'concatenate datasets' to paste the header line to the experiment count table.|
We have our count table! The numbers are fairly low below, because we have used a subset of the reads. For the next section, we will continue with the complete count table of the Candida experiment.
gene_id sample1 sample2 sample3 sample4 CAF0006876 5 3 4 4 CAF0006885 28 19 23 31 CAF0006887 0 0 0 0 CAF0006888 1 0 0 0 CAF0006889 0 0 0 0 CAF0006890 0 0 0 0 CAF0006891 0 1 0 0 CAF0006892 0 0 1 2 CAF0006893 15 24 27 19 CAF0006894 1 0 1 0 CAF0006895 2 0 0 0 CAF0006896 0 0 0 0 CAF0006897 0 2 0 0 CAF0006898 99 83 139 109 ...
Quality control on the count table
You can find a complete count table in the Data Library, under '4. ...' -> 'Candida SE 4 conditions' -> 'Count table of experiment'.
Quality control on the sample levels
The count table allows us to check how related the samples are, and if the experiment was conducted in a decent way. We will use R scripts for this: for easy of use, I have them integrated in Galaxy. Each tool has the option to display the R code, which you can reproduce in your own R environment if you want to.
We need the count table and the sample descriptions table (data library). The sample descriptions table lists properties of the samples: to which condition they belong, whether they are replicates. Actually, the sample descriptions table can never be too big! Included all experimental details which make the samples differ from each other (e.g. which day they were processed, who processed them, which centrifuge was used).
|Create a new history, and import the Candida experimental count table, and the sample descriptions from the data library '4.Extracting Counts from a mapping result.' in the Candida subfolder.|
The output is html, with optionally the R-code shown. Check the output, and answer the question below.
|Which condition seems to have aberrant samples? What could be the cause|
|The WT condition samples behave differently on the different days of extraction, although day should not be an experimental factor.|
|Sample 9 and 11 do not cluster confidently together, and seem to be intermixed with the WT AG samples.|
|From the PCA plot we observe this mixing perhaps more clearly.|
It is clear that the batch factor day has a confounding influence on the counts in the WT samples. The UPC samples and the treatment of those samples are clearly separated, increase confidence in the analyses performed on these samples.
The QC must detect confounding factors. If you detect any, these factors need to be included in the GLM used by the DE algorithm.
Quality control on the gene levels
The counts can also be depended on gene characteristics within a sample: this is so called within-lane variability, or intra-sample variability. To check this, we need 'gene' or 'feature' meta data. This is again a simple text file, with per row the gene id, and metrics such as the GC-content and the gene length. I have provided the 'feature data file' in the data library under the Candida files of section 4.
The tool to use is 'Detect confounding factors in RNA-seq count table with EDASeq' (specific to this Galaxy). This tool will check within-lane variability effects and output also mean-variance plots.
- Create a new history
- Import the experimental count table, the sample descriptions, and the feature metadata file.
- Run the tool 'Detect confounding factors in RNA-seq' with the data in the history
- You can observe that no within-lane variability exists: all samples respond in the same way to GC and length of the genes.
|Can you think of other gene features other than GC and length, which might be useful to examen?|
|The mappability of the gene is an interesting gene feature to test. It might be that some samples responded differently during the process at the mappability of the gene.|
|Check this source about mappability metrics on genes. http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0053822|
|(Not during this training:) To check for this effect, you need to provide a mappability score for each gene, and merge it with the feature metadata (column join tool).|
Detecting differential expression from a count table
From the previous section, we have learned that besides the obvious 'strain' and 'treatment' conditions, we also need to provide the 'day' conditions as a batch factor in the model.
We have implemented DESeq2 R package in Galaxy. It needs the sample descriptions file and the count table. Depending on which conditions you want to compare, you have to remove lines from the sample descriptions file. E.g. if you want to look at the difference between G and AG treatment, you do not remove any line; if you want to look at the difference in strain WT between the G and AG treatment, you filter out all lines where strain is UPC.
Filter the table to remove unnecessary tests
First we determine which fraction of features we will remove because they contain to low counts. Use the tool independent filtering for this. We need to provide again the GLM used.
Depending on your preferred p-value cut-off, you can decide to remove a certain fraction of the features, to avoid unnecessary testing influencing the multiple testing correction. This value needs to be provided in the 'Detect differential expression using DESeq2'.
Question: which fraction would you filter out, when you want a p-value cut-off 0.01? Take this value to the next step: DE detection!
Detect differential expression
The tool is "Detect differential gene expression using DESeq2". We will detect differences between the G and AG treatment.
- Create a new history and give it an appropriate name.
- Import the count table and the sample descriptions file.
- Run the DESeq2 tool with appropriate parameters
Four outputs are generated:
- the first is the report with the graphs.
- next we have three tabular files: one with only DE genes, one with the complete DESeq2 results with filtering applied, one with the completed DESeq2 results without filtering applied.
In the next section, we will need the complete DESeq2 result with filtering applied.
Question: what is the increase in number of significantly expressed genes if we apply the filtering? (hint: the report output of DESeq2 tool provides this information).
Exploring the biology behind observed changes
We need to map the feature ids to several gene sets. I have provided for you several of such mapping files in the data library for Candida:
We will use Piano, an R package which I have integrated in Galaxy as 'Gene set analysis with Piano on feature statistics.'. You can optionally display the used R code in the output.
- the gene set mappings ('Sets file')
- p-values ('Feature statistics file')
- LogFC files ('Log fold change file', provides piano the directionality of the change).
You need to create the 2 first files from the DESeq results file, by cutting out the right columns.
If you select p-value as 'Type of gene level statistics', a list of different algorithms to use appear. Hold ctrl down to select multiple tests, but please do not pick Wilcoxon as this algorithm takes a lot of time to complete.
Tip: if Piano takes to much time to complete on the training session, you can check out some sample output I have provided in the Data Library, while your tool is running.
|In the above output, which set is most upregulated? Which is most downregulated?|
|Upregulated sets have coloring at the right side of the heatmap, downregulated sets at the left side of the heatmap.|
|Most upregulated set is GO:0042254. It has rank 1 in both 'Mixed directionality' and 'Distinct directionality' (coloured in red). The lower the number, the higher the rank, hence the more upregulated|
|The most downregulated set is GO 0006810, since it as rank 1 in both 'Mixed directionality (down)' and 'Distinct directionality (down)'|
|What do these GO sets represent? You can use the tool 'Explain GO ids' in Galaxy.|
|The tool can be found under 'Gene set analysis'|
|You can provide the text results of piano (either p-values or ranking) as input, and indicate that the first column contain the GO ids.|
|You get a nice output with details|
END OF DAY 2
I hope that by following this training session, you have gathered a deep understanding of what RNA-seq for differential expression analysis is about. To do a complete analysis by your own, you can count about 12 days, depending on how experienced you are, how readily available the genome annotations are, as well as feature meta data and gene sets.
It was a pleasure to have you in this class! Have fun and amaze us!