RNA-Seq analysis for differential expression

From BITS wiki
Jump to: navigation, search


Experimental design

As we have seen during the first part of the lectures (see the presentation slides and background on [1]), experimental design is key in RNA-seq for differential expression (DE) analysis.

However, designing an experiment is somewhat a circular problem. To design an experiment, you need already to have data, which is extrapolated in order to estimate how good your experimental design will perform. You can generate pilot data for this, or use data from similar experiments (same species, same tissues, same protocol,...). Many RNA-seq experiments run with the minimum number of samples required, without doing experimental design. However, it is always instructive to send your final data to Scotty, to be able to value your experimental design.

There are 2 good online resources to help you with designing experiments:

  1. Explanation on experimental design of University of Oregon: http://rnaseq.uoregon.edu/exp_design.html
  2. Scotty, an online tool to provide you a decision matrix for your experimental design.

We will explore the use of Scotty. We have pilot data for an experiment in Drosophila, with 2 conditions (control and treatment). We have already sequenced each condition twice, to an average of 16M reads.


Download the data here and save it.

First, check the contents of the file, by opening it on your computer.

The first column is the gene identifier. Next we have 4 columns, 'Control_1' and 'Test_1', and 'Control_2' and 'Test_2'. Scotty appears to be a strict on the format. So if you upload your own data, you might reformat the column names and their order to let Scotty accept the data.

Scotty exercise

Scotty can be accessed at http://euler.bc.edu/marthlab/scotty/scotty.php. First upload your data, as shown in the screenshot below. Click choose file and upload your data.


Scotty includes also cost in their analysis. We assume that for lab work, we consume €300 per sample, irrespective of control or treatment. We are going for 50 bp reads single-end. You can find cost estimation for sequencing on slide 63 in the presentation Part 1 of the training. Recalculate from these numbers how much euros 1M of reads cost.

Next, find out what Scotty means by alignment rate. For the alignment rate number, you can base your estimate on on slide 67 in the presentation Part 1 of the training.


For the remaining options:

  • we want to detect at least 80% of expressed genes that are differentially expressed by a 2 fold change at p < 0.01
  • we have a budget of €12000.
  • Limit measurement bias by measuring at least 80% of genes with at least 50% of maximum power
  • leave the other options default.

Some explanation on 'measurement bias' (i.e. the very last option on the page). Due to the count nature, genes with high counts can be detected with greater power than genes with low counts. If we will analyse for biology later on, based on lists of significantly increased gene activity, we need to make sure that we don't find groups of genes enriched (example, genes involved in RNA processing), who just happen to have very high counts. This can be because the statistical test will more easily tell whether these high numbers differ significantly than do lower counts. Please refer to the Scotty documentation if you want to find out more about this bias and maximum power.

Another way to think about it: if we are able (with some kind of future absolute precise quantification method) to measure all mRNA coming from one gene in a cell (imagine, 245601 transcripts), and compare it to the number of all RNA coming from the same gene of another cell (say, 201548 transcripts), we don't need a statistical test! We can just see which one of the 2 cells have most RNA for that gene. Whether this difference is biological, is another matter.

So, in Scotty, we can set a condition (very last option in the interface) that at least 50% of the genes need to have counts, such that the power to detect differences is more than 50% of the maximum achievable power.

Don't worry if you don't understand: Scotty will produce sensible outcome anyhow. If you want to understand, grab a statistics handbook, and think about power, and what it means to test for a difference.


Click to run Scotty. Be patient, it takes about 5 minutes for the results to be calculated.

Interpreting Scotty output

Please use your own Scotty results to answer following questions. (If you do not have results from your own, you might download these results (pdf) to work on).

Look at the graph under 'Excluded experimental configurations'. This matrix shows the summary of Scotty's analysis. Together with the following graph, this provides rich information.

2 more count tables are available on Galaxy as input for Scotty

You can find more count tables in the Shared Data of Galaxy. Log in and go to 'shared data' --> 'Galaxy RNA-seq for DE workshop 2014' --> '1. Experimental design'.

It might be that the count tables have to be reformatted to be accepted by Scotty (mainly the header line). The reformatting can be done in Galaxy, or after downloading and editing by hand.

Investigating and preprocessing raw RNA-seq data


The exercises that follow need to be performed in the BITS Galaxy. Galaxy is an online tool to run bioinformatics analyses, and store and share them in convenient ways. You should have received an account; if not, please contact the instructor. The BITS Galaxy has been modified to include tools you won't find on other Galaxies, so please use this Galaxy.

If you want to repeat the analysis that we will do in this training on your own data, you can import the tools from the BITS Galaxy into your own Galaxy.

Store your data

The first step of any RNA-seq experiment is to store the data you have received in a sensible way. Galaxy can help you structure your data: put it in a data library. Do not forget to add enough information, called meta-data in the statistical jargon.

A typical RNA-seq experiment generates between 10 and 30 files containing reads in .fastq format, each reaching 1 to 20 GB in size. The names of the files you receive contain information about the lane used on the machine, the multiplex index used and the whether the files are part of a paired-end run or not. See below for an example.

Dernaseq 2 files1.png

For you to exercise, I have stored 5 real life datasets in the Galaxy RNA-seq for DE workshop 2014 data library [2], under '2. Investigating raw RNA-seq data'.

Dernaseq 2 datalibrary.png

This data is for you to use! Be aware: please, do NOT use complete datasets during the training. The server will crash. Use the sample sets available for each experiment. Feel free to look at the complete data at home, after the training.

We will during the training mainly focus on the Candida - SE - 4 conditions experiment.

Check the quality of the data and process it

Examine the datasets, preprocess them, and check again the improvement in quality at the end. The preprocessed reads will be used to be mapped in the next section.

The steps to perform

All tools for this section are available under NGS:QC and manipulation. For single end data sets (such as the one from Candida), follow these steps below.

For paired end datasets, you have to make sure that the preprocessing does not interrupt the link between 2 reads from one fragment. This can happen when one read is trimmed and removed because it is too short, while the other read is not trimmed, and kept in the library. Some QC tools are PE aware, others are very basic and do not care.

Get your data in a Galaxy history

Select the sample dataset of your choice from the data library and choose at the bottom 'Import to current history'. Click Go.

Dernaseq2 select.png

Dernaseq2 select2.png

Run Groomer on your data to check the format of the fastq files

New data needs to be checked consistency: the tool Groomer takes care of this. Warning: can take a long time! For this session, all data has already been put through Groomer. Only use groomer on sample datasets during this session.

Groomer tool.png

Run FastQC on your data and examine the results

After grooming, have a look at some QC metrics with FastQC.

Galaxy fastqc.png

It might be a good idea to create a workflow to let FastQC run in parallel on your data. See Creating a workflow in Galaxy.

Examen the results carefully! In this step you should detect strange deviations. Depending on what you observe, more or less removal or trimming of sequences is needed.

Fastq Quality trimmer to remove low quality regions

If your reads contain a lot of low quality regions, this tool can help you remove them.


Be aware that your sequences will not be uniform in length anymore. Some people do not like this, and want to chop off a fixed portion of each read in order to maintain the uniformity in length.

Fastqmcf to remove adaptors


In case you have detected adaptors in your data set, fetch the sequences of the adaptors in fasta format. You can search them online, or even assemble them from the FastQC report.

In case you want to align to a reference genome sequences (as in our case), and you have detected a clear presence of polyA - or polyT! - tails, those homopolymer stretches need to be filtered out. One way to do so (although not supported by the author of Fastq-mcf tool) is to provide some polyA and polyT stretches as 'pseudo-adapters'.

We have provided some adapter sequences in the data library.

Adaptors data library.png

Alternative tool for adaptor removal: clip

Get sequences of possible contaminating sources

Fetch the fasta files of the:

  • non-coding RNA sequences
  • mtDNA sequences
  • phiX genome sequences

You can fetch these sequences by using Get Data and BioMart Central Server. At the end of the BioMart query, the results are automatically uploaded to Galaxy.

Galaxy uploadbiomart.png

You can find how to fetch data from biomart into Galaxy on this page in the wiki: [3].

Optionally: these contaminating sequences are available already in the data library. However, we encourage you to get these sequences through Biomart into Galaxy yourself.

Candida contaminating sequences.png

Filter out reads that match to contaminating sequences

Use Bowtie2 to match reads to contaminating sequences. Do not forget to indicate 'Write unaligned reads to separate file(s)'! We will continue with the reads that do not match the contaminating sequences.

Galaxy bowtie2.png

Galaxy unalignedbowtieoption.png

PRO TIP It is a good idea to let Bowtie2 run in parallel, to match to rRNA, mtDNA and PhiX, and not serial (e.g. first rRNA, then mtDNA, then PhiX). Serial working leads to too much loss of information. Instead, we can summarize in a next step in Galaxy, for each read whether is has matched one of the three libraries. Next, we continue by extracting the reads that did not match any of the three libraries. I hope this makes sense.

Summarize the matched contaminating sequences

If the complete read set is matched to each of the contaminating sequence set, I want you to summarize the Bowtie2 results as follows:

HWI-ST571:278:C1AK2ACXX:5:1101:7020:36938	.	Ca19-mtDNA	CaalfMr16
HWI-ST571:278:C1AK2ACXX:5:1101:7027:14017	.	.	ITS2
HWI-ST571:278:C1AK2ACXX:5:1101:7034:70380	.	.	RDN25
HWI-ST571:278:C1AK2ACXX:5:1101:7050:64366	phix	.	.

This is easily done! First, we will convert the output of Bowtie2 from BAM to SAM for all results.


Next, filter all SAM files with the tool Filter SAM and select the option 'the read is unmapped':'no' to filter out the SAM lines of mapped reads.


Next, perform a 'Column join' on the filtered SAM files, using the first column (c1) as a hinge, and outputting the third column (c3).

Column join filteredSam.png

We end with the required output. Easy, no?!

Subtract the sequences matching to contaminant databases with 'Filter sequences by ID'

Now, we have a list of contaminating reads. We filter out those reads from our initial complete list of reads. Use 'Filter sequences by ID' .

Galaxy filtersequencesbyid.png

Run FastQC on the cleaned reads

The cleaning has been performed. Check your work!

Galaxy fastqc.png

Create a workflow of all preprocessing steps

Now, this was a lot of clicking and fizzling for just one reads file! How to do this for 20 of such files?

Make a workflow of the steps you performed on this sample fastq dataset. Next, you can run this workflow on all read sets. so they all get treated in the same way.

Gather the cleaned reads in one history

Use the function 'Copy datasets' to copy the cleaned reads to one history.

Galaxy copydatasets.png


  • Take a subsample of one of the samples the other experiments and run QC. Rank the experiment by quality? Which is the best? Which has the worst?
  • ADVANCED: The paired-end reads of the SpeciesX data library are 'interlaced' into one file. Can you de-interlace them?

Mapping processed data

After cleaning the reads to a reasonable level, the reads need to be mapped with a gap-aware aligner, such as Tophat2 or Rna-STAR.

On Galaxy [4] we have provided some preprocessed read datasets, which you can use for experimenting with mapping. Alternatively, you can continue with your own cleaned reads from the previous section. In the data library, you can find additional required files, such as annotation files. Go check them out!

Preproc reads datalib.png


Confirmed datasets: Candida, Mouse

We will align to a reference genome. We need the genome for this, the cleaned reads, and a genome annotation file.

In case the reference genome of your interest is not available as a built in Galaxy, you can put a reference genome in your history (fasta) and align against that reference genome (just as we have done with Bowtie2).

Tophat2: mapping single-end reads

1. Create a new history

2. Get into this new history a sample of the cleaned reads (available in data library)

3. Get into this new history a genome annotation file (data library)

Mapping example1 upload.png

The first mapping is an always an educated guess: go over the options outlined below and apply them.

1. Select a reference genome: this should be clear to you.


2. Library Type is unstranded (by default)


3. Use own junctions


4. Use gene annotation model


5. Select your genome annotation file


6. Only look for supplied junctions


Execute Tophat2.

When the run is done, perform quality control, as described below. Visualize your results!

Note: In case you don't have a genome annotation file, you can let Tophat2 generate one for you by searching for possible splice junctions. Tophat2 will run much longer, but will attempt to find splice junctions, known and novel ones. However, since our goal is DE, and not novel isoform detection, it is advised to run Tophat2 always with genome annotation file.

Tophat2: Paired-end reads

With paired end reads, you have to specify to mean inner distance between the paired-end reads. It typically ranges from 80 to 300 bp for RNA-seq. You have to provide the standard deviation of this mean distance, for which 2/3 of this value will do as a rule of thumb.

After mapping, visualize the mapping and do QC!

Mapping with RNA-STAR

RNA-STAR is a new aligner, and is extremely promising. First, it aligns much faster than Tophat2. Second, it provides you a nice summary of the most relevant mapping results in a log file.

You can't provide a genome annotation file to RNA-STAR: RNA-STAR has only built-in genomes available. This is because the genome annotation information was used during indexing, and is contained in the indexed reference genome. We have 4 species available for the moment.


Solve following question to get a feeling of RNA-STAR.

Visualizing the mapping in Galaxy

The visualization tool in Galaxy is called 'Trackster'.

After you have generated a BAM or SAM file visualize it by by clicking on the data set in the history. If the data set is clicked open, there is a small visualization icon. Click on it and choose Trackster.


It will ask to open a new visualisation or an existing (saved) one. If this is the first time you viewed a dataset in this experiment, you can choose new. Give the visualisation a name and view it.



The visualization will open on a certain chromosome. With Candida on our Galaxy, this happens to be the mitochrondrial genome, for which we have filtered out the reads! At the top you can select another location.

If you want to view the genes on this genome, you can simply add the genome annotation file from our history.



Now you can browse around the genome, and look at how the mapping is performed. Especially have a look on the splice site detections (none present in Candida), and the position of the reads on the genome. The reads are divided in 2 colours, one for alignment to the forward and one for alignment to the reverse strand.


Quality control tools for mapping

You can find all quality control tools for mapping under NGS: Mapping QC on our Galaxy. We have implemented Qualimap and RSeQC. Note: it might be that not all QC tools can be performed for all mapping and genome annotation files.

Galaxy mappingqc tools.png

The goal of the mapping QC is to identify aberrations in the mapping process, and to check whether the assumptions about library preparations (e.g. paired-end, strandedness,...) hold, and how saturated the reads described our transcriptome.

To appreciate all tools, we encourage you to run these QC tools on different experiments in the Data Library.


Sequence alignment format (SAM) is an extended format to describe alignment results from, for example, Tophat. It has field available for mapping quality and for indication of unmapped reads.

However, it is entirely up to the mapper to fill in this SAM correctly. For example, Tophat only includes mapped reads into the SAM file. Thus when you perform QC on the SAM, it will report that 100% of the reads are mapped! Be aware when you see such numbers.


Input: BAM or SAM file.

The output is an HTML report with statistics of the alignment. One note: BamQC is mainly for DNA-seq, not RNA-seq. Some of the graphs (e.g. coverage over the complete reference sequence) is of no use for use, since we will have only coverage where active genes are located.

Nevertheless, it is a nice tool to run on your BAM/SAM especially in the paired-end case.


Input: BAM/SAM and corresponding GTF. Note: some of these tools give empty outputs in some cases, depending on the alignment program used and the genome annotation file provided. We apologize for this.

The overview of the tools can be found in the presentation of this section.



Extracting counts and investigating experimental factors

We will make use of HTSeq-count, which is available on our Galaxy. To extract the counts we need an alignment result (usually BAM), and a genome annotation file. From the GTF file we will fetch the coordinates of the features to count.

Check the tool on a sample set

  • Create an empty history with the name 'test count table extraction'.
  • Go to the Data Library 'Galaxy RNA-seq for DE workshop 2014' and import from under '4. Extracting counts from a mapping result' one of the Candida sample alignment files.
  • Import to the history the corresponding Candida GTF file from under '3. Mapping preprocessed data' (do not pick the GFF file!)
  • Run HTSeq-count, with the appropriate options: Union mode of counting, stranded is 'reverse' (see the mapping visualization and QC to check), and 20 as minimum mapping quality. The features we want to count are 'exons', which need to be summarized by 'gene_id'. The naming of these features you can check in the GTF file. Execute!
  • The tool on the sample Candida data set will take about 1 minute and 15 seconds.
  • You have 2 outputs: one log file, and one count table for this sample.

Warning: HTSeq-count can only be as good as the BAM file provided. If the mapper you use do not provide enough alignment details, such as mapping quality, or rude flag fields, it might be that HTSeq-count needs to ignore more reads because of the lack of information. Try to use established alignment software. Definitely check the numbers of HTSeq-count!

Apply the same parameters on all samples

To do so, we will create a workflow and apply the workflow on the samples simultaneously.

  • The easiest way to create a workflow: do your analyses, hereby creating a history.
  • Convert the history to a workflow

Workflow create from history.png

  • The workflow is available from your workflows menu at the top. Here you can edit your workflow (change parameters of the tools, add tools,...).
    Edit workflow galaxy.png.
  • Let's run the workflow. Create a new history named Candida count table extraction.
  • Import all Sample alignment sets from Candida.
  • Run the workflow on all samples in the history.
  • After 1 minute 15 seconds all count tables should be calculated! Nice parallel work!
    Galaxy ct worfklow.png

One thing you want to do, is to rename the generated count tables (with the pencil icon) to a sensible name.

Merging the count tables of the separate samples together

We have a count table per sample. For DE analysis, we need to merge the counts in one table.

We have our count table! The numbers are fairly low below, because we have used a subset of the reads. For the next section, we will continue with the complete count table of the Candida experiment.

gene_id	sample1	sample2	sample3	sample4
CAF0006876	5	3	4	4
CAF0006885	28	19	23	31
CAF0006887	0	0	0	0
CAF0006888	1	0	0	0
CAF0006889	0	0	0	0
CAF0006890	0	0	0	0
CAF0006891	0	1	0	0
CAF0006892	0	0	1	2
CAF0006893	15	24	27	19
CAF0006894	1	0	1	0
CAF0006895	2	0	0	0
CAF0006896	0	0	0	0
CAF0006897	0	2	0	0
CAF0006898	99	83	139	109

Quality control on the count table

You can find a complete count table in the Data Library, under '4. ...' -> 'Candida SE 4 conditions' -> 'Count table of experiment'.

Quality control on the sample levels

The count table allows us to check how related the samples are, and if the experiment was conducted in a decent way. We will use R scripts for this: for easy of use, I have them integrated in Galaxy. Each tool has the option to display the R code, which you can reproduce in your own R environment if you want to.

We need the count table and the sample descriptions table (data library). The sample descriptions table lists properties of the samples: to which condition they belong, whether they are replicates. Actually, the sample descriptions table can never be too big! Included all experimental details which make the samples differ from each other (e.g. which day they were processed, who processed them, which centrifuge was used).

We will run the 'Quality control of a count table' (a tool specific to this Galaxy).
Qc ct tool.png.

The output is html, with optionally the R-code shown. Check the output, and answer the question below.

It is clear that the batch factor day has a confounding influence on the counts in the WT samples. The UPC samples and the treatment of those samples are clearly separated, increase confidence in the analyses performed on these samples.

The QC must detect confounding factors. If you detect any, these factors need to be included in the GLM used by the DE algorithm.

Quality control on the gene levels

The counts can also be depended on gene characteristics within a sample: this is so called within-lane variability, or intra-sample variability. To check this, we need 'gene' or 'feature' meta data. This is again a simple text file, with per row the gene id, and metrics such as the GC-content and the gene length. I have provided the 'feature data file' in the data library under the Candida files of section 4.

The tool to use is 'Detect confounding factors in RNA-seq count table with EDASeq' (specific to this Galaxy). This tool will check within-lane variability effects and output also mean-variance plots.

  • Create a new history
  • Import the experimental count table, the sample descriptions, and the feature metadata file.
  • Run the tool 'Detect confounding factors in RNA-seq' with the data in the history
  • You can observe that no within-lane variability exists: all samples respond in the same way to GC and length of the genes.

Edaseq parameters.png

Detecting differential expression from a count table

From the previous section, we have learned that besides the obvious 'strain' and 'treatment' conditions, we also need to provide the 'day' conditions as a batch factor in the model.

We have implemented DESeq2 R package in Galaxy. It needs the sample descriptions file and the count table. Depending on which conditions you want to compare, you have to remove lines from the sample descriptions file. E.g. if you want to look at the difference between G and AG treatment, you do not remove any line; if you want to look at the difference in strain WT between the G and AG treatment, you filter out all lines where strain is UPC.

Filter the table to remove unnecessary tests

First we determine which fraction of features we will remove because they contain to low counts. Use the tool independent filtering for this. We need to provide again the GLM used.

Indepfilt par.png

Depending on your preferred p-value cut-off, you can decide to remove a certain fraction of the features, to avoid unnecessary testing influencing the multiple testing correction. This value needs to be provided in the 'Detect differential expression using DESeq2'.

Output indepfilter par.png

Question: which fraction would you filter out, when you want a p-value cut-off 0.01? Take this value to the next step: DE detection!

Detect differential expression

The tool is "Detect differential gene expression using DESeq2". We will detect differences between the G and AG treatment.

  • Create a new history and give it an appropriate name.
  • Import the count table and the sample descriptions file.
  • Run the DESeq2 tool with appropriate parameters

Deseq2 gvsag.png

Four outputs are generated:

  • the first is the report with the graphs.
  • next we have three tabular files: one with only DE genes, one with the complete DESeq2 results with filtering applied, one with the completed DESeq2 results without filtering applied.

Deseq2 output.png

In the next section, we will need the complete DESeq2 result with filtering applied.

Question: what is the increase in number of significantly expressed genes if we apply the filtering? (hint: the report output of DESeq2 tool provides this information).

Exploring the biology behind observed changes

We need to map the feature ids to several gene sets. I have provided for you several of such mapping files in the data library for Candida:

Genesets candida.png

We will use Piano, an R package which I have integrated in Galaxy as 'Gene set analysis with Piano on feature statistics.'. You can optionally display the used R code in the output.

Piano needs:

  • the gene set mappings ('Sets file')
  • p-values ('Feature statistics file')
  • LogFC files ('Log fold change file', provides piano the directionality of the change).

You need to create the 2 first files from the DESeq results file, by cutting out the right columns.

If you select p-value as 'Type of gene level statistics', a list of different algorithms to use appear. Hold ctrl down to select multiple tests, but please do not pick Wilcoxon as this algorithm takes a lot of time to complete.

Tip: if Piano takes to much time to complete on the training session, you can check out some sample output I have provided in the Data Library, while your tool is running.

Piano results1.png


I hope that by following this training session, you have gathered a deep understanding of what RNA-seq for differential expression analysis is about. To do a complete analysis by your own, you can count about 12 days, depending on how experienced you are, how readily available the genome annotations are, as well as feature meta data and gene sets.

It was a pleasure to have you in this class! Have fun and amaze us!