Introduction to RNA-Seq analysis

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS data analysis ]

Download the slides for this training session.

The dataset comes from a 2014 publication on Human Airway Smooth Muscle Transcriptome Changes in Response to Asthma Medications ([1]).


The goal of the analysis is to find DE genes (differentially expressed: genes with different expression levels in one group of samples compared to other groups of samples). Typically the groups of samples represent different treatments: one consisting of biological replicates that have received a control treatment, others consisting of replicates that received a specific biological treatment.

In this experiment the data consists of four groups (treatment):

  • The dex group: samples from 4 cell lines after treatment with the glucocorticoid dexamethasone (dex), used as astma medication
  • The alb group: samples from the same cell lines after treatment with albuterol (alb), another astma medication
  • The alb_dex group: samples from the same cell lines after treatment with both astma medications
  • The untreated group: samples from the same untreated cell lines cultured in parallel.

So all samples come from the same 4 cell lines (cells).

#   run_accession  read_count  samples            cells    treatment
1   SRR1039508     22935521    CL1_untreated   CL1   untreated
2   SRR1039509     21155707    CL1_Dex         CL1   Dex
3   SRR1039510     22852619    CL1_Alb         CL1   Alb
4   SRR1039511     21938637    CL1_Alb_Dex     CL1   Alb_Dex
5   SRR1039512     28136282    CL2_untreated   CL2  untreated

The data comes from a paired-end sequencing experiment so we have two files for each sample.

For simplicity we will do the analysis on a single sample, SRR1039509, obtained from dexamethasone treated cell line 1.

Quality checks

Before you analyze the data, it is crucial to check the quality of the data.
We use the standard tool for checking the quality of NGS data generated on the Illumina platform: FASTQC [2]

Correct interpretation of the FASTQC report is very important.
If the quality of your data is good, you can proceed with the analysis.
!! If the quality of your data is very bad, don't immediately throw the data in the recycle bin but contact an expert and ask for his/her opinion. !!

Double click the FASTQC icon on the Desktop and open the fastq file (it's in the summer folder of your home folder).

FASTQC consists of multiple modules each checking a specific aspect of the quality of the data. On the first page you can select the module you wish to view.

The names of the modules are preceded by an icon that reflects the quality of the data. The icon indicates whether the results of the module seem normal (green tick), slightly abnormal (orange triangle) or very unusual (red cross).

However, these evaluations must be interpreted in the context of what you expect from your library. A 'normal' sample as far as FastQC is concerned is random and diverse. Some experiments may be expected to produce libraries which are biased. You should treat the icons as pointers to where you should concentrate your attention on and understand why your library may not look normal.

General information on the reads

Checking the quality scores of the reads

Phred scores represent base call quality. The higher the score the more reliable the base call. Often the quality of reads degrades over the length of the read. Therefore, it is common practice to determine the average quality of the first, second, third,...nth base by plotting the distribution of the Phred scores on each position of the reads using box plots.

Remark: In new Illumina kits the sequence quality goes up a bit first before it steadily declines.

Instead of showing the quality of each position separately, you can calculate the average Phred score of each read and show a cumulative plot of the average qualities of all the reads.

Illumina flow cells are divided into tiles. To see if there is a loss in quality associated with specific parts of the flow cell, FASTQC calculates average quality scores for each tile across all positions in the reads.

Reasons for seeing warnings or failures on this plot could be transient problems such as bubbles going through the flow cell, or they could be more permanent problems such as smudges or debris on/in the flow cell or a very high density of clusters in a tile. The most common cause of warnings in this module is the flow cell being overloaded.

It is recommended to ignore warnings/failures which mildly affect a small number of tiles for only a few cycles, and to only pursue larger effects which show high deviation in scores, or which persist for a high number of cycles.

Checking duplicates

In a diverse library generated by shearing genomic DNA, most fragments will occur only once. A low level of duplication may indicate a very high level of coverage of some target sequences, but a high level of duplication indicates a bias (eg PCR overamplification, contamination of the library with adapter dimers...).

The Sequence duplication levels module counts the degree of duplication for every read and creates a plot showing the relative number of reads with different degrees of duplication.

Since the reads are random fragments from the genome sequence, the contribution of A, C, G and T should be identical on each position.

Duplicates often arise because libraries are contaminated with adapter sequences. You can check for contaminating sequences using the Overrepresented sequences module: it lists all sequences which make up more than 0.1% of the total. For each sequence in the list the module will look for matches in a database of common contaminants and will report the best hit it finds.

The Overrepresented sequences module shows contamination with full adapter sequences (= reads that completely correspond to adapters), but often the library also contains reads that have remnants of adapter sequences at their 3' ends. These reads are not detected by the Overrepresented sequences module.

This was the quality check of one file from one of the 16 samples. We do not have the time to do all quality checks. But in the real world, you would have to do this for each of the 32 files of this experiment.

Improving the quality of the data

There are many possible steps to improve the quality of the data. Due to time constraints, we are going to focus on

  • removing adapter sequences, both filtering full adapter sequences and trimming remnants of adapters from the 3' ends of the reads
  • filter reads of low quality

There are many tools to remove adapters from reads, but we chose cutadapt because it works on paired-end reads and it can do the two steps at the same time (removing adapters and filtering reads of poor quality).

Cutadapt has already been installed on the laptops, a detailed description of the installation procedure is available.

To make it feasible to go through the complete RNA-Seq workflow during the training, we have limited the data set to reads that map to chromosome 22. The data come from a paired-end experiment so we have two files with reads. You can download these limited data sets: data_set_1 (first reads of a pair that map to chromosome 22) and data_set_2 (second reads of a pair that map to chromosome 22). On the bits laptops, the files are already present in the /home/bits/NGS/RNASeq/ folder.

Remember that the Overrepresented sequence module of the FASTQC report showed contamination with the following TruSeq adapter in the first file of sample SRR1039509:

We will remove this adapter from the file containing the reads that map to chromosome 22.

Open the terminal.

We will use a few other options:

  • Discard trimmed reads that are shorter than 20 bases after trimming using the -m option
  • Trim low-quality bases at the 3'ends from reads before adapter removal if their Phred score is less than 10 using the -q option

Remember that we are working with paired-end reads !

Since we have to specify the location in the file system of two input and two output files, we are going to create a variable called folder for holding the path.

In the cutadapt stats you see we only trimmed one file (containing sequences from one end of the fragments).


That is because the sequences from the other end of the fragments contain another adapter:


Now you see in the cutadapt stats that you have trimmed adapters from (both files) both ends of the fragments.


Check the quality of the cleaned reads

I have done this also for the complete files and rechecked the trimmed reads in FASTQC. You can download the report for the complete trimmed reads from sample SRR1039509. On the Bits laptops the files are in the /home/bits/NGS/RNASeq/ folder.

FastQC confirmed the removal of the two adapters by cutadapt.


Obtaining the reference genome

Before we can do any mapping we need a reference sequence first. We will map the reads against the hg19 human genome build. Mapping requires a specially formatted file (hash database). This hash database can be derived from the reference genome using the bowtie2 tools. However, for some organisms like human the hash table can be obtained 'ready-to-use' from the bowtie2 website. If you also need a fasta copy of the hg19 genome, you can obtain it from the hash table using bowtie2.

We can download the hash table from the bowtie2 website using the wget command. It takes about 90 minutes to download so we are not going to download it during the training, it is already present in the /home/bits/NGS/RNASeq/reference/ folder.

Go to this folder and look at its contents. As you can see the file is a compressed .zip file

To transform the hash table into a fasta sequence we use bowtie2. From the bowtie2 documentation we learn that we should use bowtie2-inspect without options to generate the fasta file.

Installing TopHat

Mapping RNA-Seq reads is done using the TopHat tool. So we need to install the TopHat tool.

TopHat is downloaded as a .tar.gz file

Go into the tophat folder and type:


If this opens the help of tophat, it means the software has been installed correctly. It does not mean that you can use the software now. Well you can but you will always have to type the commands from inside the tophat folder or provide the full path to the tophat folder. To avoid this we can create a symbolic link for tophat2.

Now go to a different folder and type tophat2. If you see the help file, the link works.

Installing samtools

When you navigate to the tophat folder in /usr/bin/NGS/ you see that samtools is automatically installed when TopHat was installed:


If you see the samtools help page when you type


it means that samtools is indeed installed


If you want to use samtools from anywhere in the file system you can create a soft link.

Mapping the reads

We are not going to do the actual mapping since it takes almost 25 minutes even with the chromosome22-limited datasets. If we were to map the reads we would use the following command:

tophat2 --no-coverage-search ${folder}reference/hg19 ${folder}chr22_SRR1039509_1.fastq.gz ${folder}chr22_SRR1039509_2.fastq.gz
  • --no-coverage-search: is related to how TopHat finds splice junctions. I'm not going to go into detail here but in the TopHat manual the developers of TopHat say: "We only suggest users use the --coverage-search option for short reads (< 45bp) and with a small number of reads (<= 10 million)." Since we have the double amount of longer reads (63bp) we have to go for the --no-coverage-search option.
  • the first argument is the location of the hash table of the reference genome
  • the second argument is the (cleaned) fastq file containing the reads from one end of the fragments. As you can see TopHat can work directly on the compressed file.
  • the third argument is the (cleaned) fastq file containing the reads from the other end of the fragments

Other useful options for Tophat:

  • -p: the number of processors (cpu) that TopHat can use for the mapping. The default is 1. This is ok for a laptop since laptops do not contain manu cpu but of course the more cpu you give TopHat the faster the mapping. So it's better to do the mapping on a strong computer with many cpu
  • -o: if you want to store the results of the mapping in another folder

The mapping generates a new folder tophat_out containing 3 .bed files and 2 .bam files containing the resulting alignments:

  • accepted_hits.bam: a list of read alignments.
  • unmapped.bam: a list of reads that could not be mapped. As you can see the size of this file is quite small compared to the accepted_hits.
  • junctions.bed: a list of splice junctions between exons in UCSC BED format[3] (that can be opened as a track in the UCSC genome browser).
    Tophat can find novel - not yet annotated - splice junctions based on the alignment of the reads to a reference genome. This is what Tophat is specifically good at, compared to mappers like bwa and bowtie which will only find annotated splice junctions. This is why we use Tophat for mapping RNA-Seq data.
  • insertions.bed: a list of insertions.
  • deletions.bed: a list of deletions.


Since we haven't actually done the mapping, we do not have this folder. However, you can find the bam file with the read alignments in the /home/bits/NGS/RNASeq folder.

Quality control of the mapping

It is vital to check the quality of mapping before proceeding with the RNASeq workflow. The mapping to a reference genome has sorted the reads and it is now possible to identify

  • the regions of the genome the reads originate from
  • duplicate reads
  • RNA degradation...

Several program exist to perform quality control of bam files; e.g. RSeQC[4], QualiMap[5][6], samtools, deeptools [7][8], Picard[9] which is part of the very popular GATK platform[10]...

We are going to use samtools here.

The samtools flagstat command displays an overview of the alignment results on your screen. You just see that 100% of the reads were mapped. This is extremely high but it is of course because we reversed engineered our chromosome 22 limited data set. From the complete fastq files we took the reads that mapped to chromosome 22 so it's normal that we get an almost perfect mapping.


This overview deserves some explanation:

  • nan means Not A Number (e.g: divided by 0 )
  • paired in sequencing means reads that belong to a pair regardless of the fact that they are really mapped as a pair
  • read1 means forward reads
  • read2 means reverse reads
  • properly paired means that both mates of a read pair map to the same chromosome, oriented towards each other, and with a sensible insert size
  • with itself and mate mapped means that both reads of a pair map to the genome but they are not necessarily properly paired, they just map somewhere on the genome
  • singletons means that one of the reads of a pair is unmapped while its mate is mapped
  • with mate mapped to a different chr means reads with a mate mapped on a different chromosome
  • with mate mapped to a different chr (mapQ >= 5) means reads with a mate mapped on a different chromosome having a mapping quality greater than 5

Tools like Qualimap, RSeqQC and Picard will give much more detailed information on the quality of the mapping. Unfortunately we do not have time to use them.

Calculating a count table

In order to compute differential expression between groups of samples, we need to convert mapping results to read counts for each gene in each sample. The counting can also be done in R using various packages but will be slower as compared to command-line tools.

We will use the popular HTSeq-count tool [11] to compute gene counts.

Prepare the alignment file

We need to sort the bam file since we have paired-end reads. HTSeq assumes the file is sorted so that reads belonging to the same pair are in adjacent lines. If you don't sort the bam file by read name, HTSeq will think there are lot of reads with missing mates.

In the samtools manual we can look up which command we need to do the sorting.

According to the HTSeq manual the input file for HTSeq contains the aligned reads in SAM format. In our case the mapping generated a .bam file. Fortunately samtools contains scripts to convert BAM format to SAM.

In the samtools manual we can look up which command we need to do the transformation.

Obtaining a reference annotation file

To calculate read counts we need a gtf file containing the annotation of all exons. You can obtain such files from genome annotation databases such as NCBI, Ensembl, and UCSC. The problem is that there are small differences between the formats of annotation files coming from different databases. These differences have implications for counting the reads.

For instance, we used pre-built index files from the Bowtie website for the mapping. These files have UCSC format. So it seems obvious to use UCSC annotation files for the counting. However, HTSeq prefers Ensembl gtf files. As stated in the HTSeq documentation using gtf file generated by UCSC will result in very low counts. In the UCSC files, the gene_id incorrectly contains the same value as the transcript_id. Hence, if a read maps to an exon shared by several transcripts of the same gene, this will appear to htseq-count as an overlap between different genes since the different transcripts have different gene_ids. The read will be considered ambiguous and not counted. Therefore, the counts will be incorrect.

As a solution, HTSeq recommends to use a gtf file from Ensembl. You can find Ensembl gtf files on the Ensembl ftp server. The version that we need is called grch37 (this corresponds to UCSC genome build hg19). So you can download the gtf file from this web site. To save time, the gtf is already present on the bits laptops in the /home/bits/NGS/RNASeq/reference/ folder.

Navigate to the /home/bits/NGS/RNASeq/reference/ folder:

As you can see the first column of the file contains chromosome numbers. Ensembl uses 1, 2, 3... as chromosome IDs.

As you can see the third column of the file contains chromosome IDs but they have UCSC format: chr22 (remember that all reads come from chromosome 22 in our example). So we need to:

  • Filter the annotation for chromosome 22 from the gtf file to limit processing time.
  • Transform Ensembl chromosome IDs into UCSC format.

First of all we'll give the gtf file a simple name to simplify processing.

Now, we still need to transform the Ensembl chromosome IDs into UCSC format, meaning that we simply need to add the prefix chr to each line in the filtered gtf file. You'll need the sed command for this. Look at the sed documentation before you try to do the substitution. To add the word chr to the start of each line, you essentially need to replace the start of a line by chr.

Installing HTSeq

HTSeq is a Python script. Python scripts can be installed using the pip install command. Remember you need administrator privileges for installing tools.

Looking up the error in Google leads to this web page, where you can find the solution to the problem: some dependencies are missing.

Looking up the error in Google leads to this web page, where you can find the solution to the problem: the C compiler is missing.

Calculating the count table

HTSeq counts reads in different modes:


Handicon.png We will use the union method

In the HTSeq manual we get an overview of the options we can use.

We'll go for the default mininimum alignment score of 10 (90% confidence).

For a .gtf file exon is the default. It means HTSeq will count the number of reads that align to each exon and then combine the counts for all exons of a transcript variant.

For a .gtf file gene_id is the default: it means that the output of HTSeq will be a list of gene_ids and for each gene_id you'll see the number of reads that align to all its exons.

  1. Blanca E Himes, Xiaofeng Jiang, Peter Wagner, Ruoxi Hu, Qiyu Wang, Barbara Klanderman, Reid M Whitaker, Qingling Duan, Jessica Lasky-Su, Christina Nikolos, William Jester, Martin Johnson, Reynold A Panettieri, Kelan G Tantisira, Scott T Weiss, Quan Lu
    RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells.
    PLoS ONE: 2014, 9(6);e99625
    [PubMed:24926665] ##WORLDCAT## [DOI] (I e)

  4. Liguo Wang, Shengqin Wang, Wei Li
    RSeQC: quality control of RNA-seq experiments.
    Bioinformatics: 2012, 28(16);2184-5
    [PubMed:22743226] ##WORLDCAT## [DOI] (I p)

  6. Fernando García-Alcalde, Konstantin Okonechnikov, José Carbonell, Luis M Cruz, Stefan Götz, Sonia Tarazona, Joaquín Dopazo, Thomas F Meyer, Ana Conesa
    Qualimap: evaluating next-generation sequencing alignment data.
    Bioinformatics: 2012, 28(20);2678-9
    [PubMed:22914218] ##WORLDCAT## [DOI] (I p)

  7. Fidel Ramírez, Friederike Dündar, Sarah Diehl, Björn A Grüning, Thomas Manke
    deepTools: a flexible platform for exploring deep-sequencing data.
    Nucleic Acids Res.: 2014, 42(Web Server issue);W187-91
    [PubMed:24799436] ##WORLDCAT## [DOI] (I p)

  10. Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, Mark A DePristo
    The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
    Genome Res.: 2010, 20(9);1297-303
    [PubMed:20644199] ##WORLDCAT## [DOI] (I p)