NGS RNASeq DE Exercise.1

From BITS wiki
Jump to: navigation, search

Download Illumina data from SRA and perform read QC


[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.2 ]


The choice of the data used for this training was dictated by the DESeq2 bioconductor rnaseqGene tutorial used in a late exercise. The data referred to by this package was fetched from the SRA in order to perform read cleanup and mapping steps not part of the DESeq2 document. Trainnes will run altered versions of the scripts used for the full dataset in order to speed up the runs while getting an idea of the real process. At each step of the session, data resulting from the full analysis is provided in order to allow each exercise be practiced independently even when some software dependencies errors lead to errors. For reference to this training, please do not forget to cite the original document fetched from the bioconductor site ([1]).


Dataset used in this hands-on session

The dataset used today refers to a 2014 publication about Human Airway Smooth Muscle Transcriptome Changes in Response to Asthma Medications ([2]).



Obtaining GEO information and links for the 'GSE52778' dataset

The processed data for this study is available at GEO under the identifier GSE52778 [3]

Before analyzing any NGS data it is of good habit to check how the data was generated and which platform was used.



Obtain the data from SRA

Raw data can be accessed from SRA under the reference SRP033351 [4] by clicking the SRA ID in the Relations section of the GEO record.


In order to obtain data directly from the SRA database, users will need to install the SRAtoolkit [5].

In order to download reads using the fastera system, the aspera connect software will need to be installed from aspera[6]

The 'SRP033351' data is also available as a standalone R-data library ("airway"). This will facilitate participants to pursue self-training after this session and allow direct entry in the R-bioconductor analysis exercises.

Technical.png Note that while we use [R-Bioconductor] several times in this session, this training is not exclusive to the use of [R] and is more generally aimed at learning the most popular and current command line applications used RNASeq workflows on average Unix operating system computers


Obtain data and metadata from ENA using R

Rationale: Asthma is a chronic inflammatory airway disease. The most common medications used for its treatment are ß2-agonists and glucocorticosteroids, and one of the primary tissues that these drugs target in the treatment of asthma is the airway smooth muscle. We used RNA-Seq to characterize the human airway smooth muscle (HASM) transcriptome at baseline and under three asthma treatment conditions.
Methods: The Illumina TruSeq assay was used to prepare 75bp paired-end libraries for HASM cells from four white male donors under four treatment conditions: 1) no treatment; 2) treatment with a ß2-agonist (i.e. Albuterol, 1µM for 18h); 3) treatment with a glucocorticosteroid (i.e. Dexamethasone (Dex), 1µM for 18h); 4) simultaneous treatment with a ß2-agonist and glucocorticoid, and the libraries were sequenced with an Illumina Hi-Seq 2000 instrument. The Tuxedo Suite Tools were used to align reads to the hg19 reference genome, assemble transcripts, and perform differential expression analysis using the protocol described in
Overall Design: mRNA profiles obtained via RNA-Seq for four primary human airway smooth muscle cell lines that were treated with dexamethasone, albuterol, dexamethasone+albuterol or were left untreated.

As in any high throughput data analysis, metadata is required to relate SRA sample IDs to grouping information. To compare the groups of samples - which is the ultimate goal of the analysis - you need to know which samples belong to which groups.

The ENA[7] offers a nice search utility page to obtain such information that was used extract selected columns. You can get data annotation for any SRA/ENA project using the url as shown below.

A small [R] script was made in RStudio to download the data of SRP033351 from the ENA page into a 'data.frame'. Only the fields that are relevant to download the reads are extracted here but other fields are available on the server for other purpose, including sample descriptions.

PID <- "SRP033351"
ena.url <- paste("",
metadata <- read.table(url(ena.url), header=TRUE, sep="\t")
# get first line and transpose

The first row reports the following information:

run_accession SRR1039508
library_name NA
read_count 22935521

A second piece of code was prepared to extract metadata. The grouping was absent in the database, so we copied it from the GEO page and stored it in the 'samples' vector.

PID <- "SRP033351"
ena.url <- paste("",
# assemble into a data.frame
metadata <-, header=TRUE, sep="\t"))
# add missing information about cells & treatment
metadata$samples <- samples
# sample the first line
# add new columns
metadata$cells <- str_extract(metadata$sample, "\\b[A-Za-z0-9]+")
metadata$treatment <- mapply(sub,paste(metadata$cells,"_",sep=""), "", metadata$sample)
# save to file for reuse
write.table(metadata, file="GSE52778_metadata.txt", row.names=F, sep="\t", quote=FALSE)
# view table
   run_accession read_count           samples   cells treatment
1     SRR1039508   22935521  N61311_untreated  N61311 untreated
2     SRR1039509   21155707        N61311_Dex  N61311       Dex
3     SRR1039510   22852619        N61311_Alb  N61311       Alb
4     SRR1039511   21938637    N61311_Alb_Dex  N61311   Alb_Dex
5     SRR1039512   28136282 N052611_untreated N052611 untreated
6     SRR1039513   43356464       N052611_Dex N052611       Dex
7     SRR1039514   40021727       N052611_Alb N052611       Alb
8     SRR1039515   29454748   N052611_Alb_Dex N052611   Alb_Dex
9     SRR1039516   30043024 N080611_untreated N080611 untreated
10    SRR1039517   34298260       N080611_Dex N080611       Dex
11    SRR1039518   30441200       N080611_Alb N080611       Alb
12    SRR1039519   31533740   N080611_Alb_Dex N080611   Alb_Dex
13    SRR1039520   34575286 N061011_untreated N061011 untreated
14    SRR1039521   41152075       N061011_Dex N061011       Dex
15    SRR1039522   28406812       N061011_Alb N061011       Alb
16    SRR1039523   31099538   N061011_Alb_Dex N061011   Alb_Dex

As seen in the results, ftp, Aspera, and galaxy links are returned which can alternately be used to download the reads or integrate them in your favorite Galaxy server.


Download SRA-formatted data and convert it to fastQ using the SRA toolbox

The SRA data is not in FastQ format required for mapping but in a NCBI proprietary format which is handier for file exchange. We therefore need to download the data and convert it. This can be done manually or using a script to ease the process and run in background while you do other work at the bench or overnight.

Typical ftp and aspera links for the first 'SRP033351' sample 'SRR1039508' are:

# ftp link
# aspera NCBI link
# corresponding aspera link (from the European archive)

The easiest way to download SRA data is to proceed manually, file by file, from the browser. However, this can prove quit lengthy when you need 16 files as we now do. One alternate method involves creating a batch download script that uses the ftp list or the similar list of aspera links.

Handicon.png Aspera is a transfer protocol that includes error correction and is much faster that the conventional FTP. Please refer to Aspera Transfer Guide - SRA Handbook - NCBI Bookshelf [8]

We prepared a batch script '' to get all 16 SRA files and regenerate FASTQ paired read files from each. The code is reproduced below and can be adapted for other datasets with few changes. This script relies on the presence of the Aspera program and the required certificate at given locations defined in the code. If your Aspera executable and certificate are located elsewhere, please edit the corresponding lines. script

# required: SRAtoolkit installed and in PATH
# required: a functional aspera connect installation
## Aspera main parameters
# –Q (for adaptive flow control) – needed for disk throttling!
# –T to disable encryption
# –k1 enable resume of failed transfers
# –l (maximum bandwidth of request, try 200M and go up from there)
# –r recursive copy
# –i <private key file>
## set the following two variables according to your own Aspera location
# this should match sra-connect installed for a single user under unix
# /home/<user>/.aspera/connect
# create a LIST of files to download and proceed with aspera
LIST="SRR1039508 SRR1039509 SRR1039510 SRR1039511 SRR1039512 SRR1039513 SRR1039514 SRR1039515 SRR1039516 SRR1039517 SRR1039518 SRR1039519 SRR1039520 SRR1039521 SRR1039522 SRR1039523"
# adapt the following line if you need other reads
# it should end with the first letters of your SRA files
# create container for data and move into it
mkdir -p ${PRJ}_fastq/sra_downloads && cd ${PRJ}_fastq
# loop in the LIST and download one file at a time
for sra in $LIST; do
	# convert ID to URL using SRAtoolkit 'srapath'
	uri=$(srapath ${sra} | sed 's||baseurl|')
	## download SRA data
	cmd="${exefile} \
		-i ${sshcert} \
		-k 1 \
		-QTr \
		-l 10000m \
		${uri} \
	echo "# $cmd"
	eval $cmd
	# test if transfer succeeded or die
	if [ $RESULT -eq 0 ]; then
		echo "# Aspera transfer succeeded for ${sra}"
		echo "# Aspera transfer failed for ${sra}, aborting!"
		exit 1
	## convert SRA to fastq
	# archive data in gzip format to save space
	fastq-dump --split-3 --gzip sra_downloads/${sra}.sra
	# test if conversion succeeded or die
	if [ $RESULT -eq 0 ]; then
		echo "## SRA to FASTQ conversion succeeded for ${sra}"
		echo "## SRA to FASTQ conversion failed for ${sra}, aborting!"
		exit 1
# return where you came from
cd -
# test for happy ending
if [ $? = 0 ]; then
echo "### all steps succeeded"
echo "### something ended wrong!, please check!"

fastq extraction results

# note that some but not each sample have additional single-end reads resulting from pre-processing done by the provider

Read 22935521 spots for sra_download/SRR1039508.sra
Written 22935521 spots for sra_download/SRR1039508.sra

Read 21155707 spots for sra_download/SRR1039509.sra
Written 21155707 spots for sra_download/SRR1039509.sra

Read 22852619 spots for sra_download/SRR1039510.sra
Written 22852619 spots for sra_download/SRR1039510.sra
Rejected 5327734 READS because of filtering out non-biological READS

Read 21938637 spots for sra_download/SRR1039511.sra
Written 21938637 spots for sra_download/SRR1039511.sra

Read 28136282 spots for sra_download/SRR1039512.sra
Written 28136282 spots for sra_download/SRR1039512.sra
Rejected 13299398 READS because of filtering out non-biological READS

Read 43356464 spots for sra_download/SRR1039513.sra
Written 43356464 spots for sra_download/SRR1039513.sra
Rejected 17664215 READS because of filtering out non-biological READS

Read 40021727 spots for sra_download/SRR1039514.sra
Written 40021727 spots for sra_download/SRR1039514.sra

Read 29454748 spots for sra_download/SRR1039515.sra
Written 29454748 spots for sra_download/SRR1039515.sra
Rejected 9413805 READS because of filtering out non-biological READS

Read 30043024 spots for sra_download/SRR1039516.sra
Written 30043024 spots for sra_download/SRR1039516.sra

Read 34298260 spots for sra_download/SRR1039517.sra
Written 34298260 spots for sra_download/SRR1039517.sra

Read 30441200 spots for sra_download/SRR1039518.sra
Written 30441200 spots for sra_download/SRR1039518.sra

Read 31533740 spots for sra_download/SRR1039519.sra
Written 31533740 spots for sra_download/SRR1039519.sra
Rejected 273313 READS because of filtering out non-biological READS

Read 34575286 spots for sra_download/SRR1039520.sra
Written 34575286 spots for sra_download/SRR1039520.sra
Rejected 2744054 READS because of filtering out non-biological READS

Read 41152075 spots for sra_download/SRR1039521.sra
Written 41152075 spots for sra_download/SRR1039521.sra

Read 28406812 spots for sra_download/SRR1039522.sra
Written 28406812 spots for sra_download/SRR1039522.sra

Read 31099538 spots for sra_download/SRR1039523.sra
Written 31099538 spots for sra_download/SRR1039523.sra
Rejected 26533376 READS because of filtering out non-biological READS


Identifying FastQ quality issues

The examples provided in the wikiBooks page refer to single-end reads and use trimmomatic for data filtering while we have paired-end reads. When applying filtering to paired-end reads separately with programs like trimmomatic, some pairs will loose one end due to complete sequence removal and consequently cause a sync problem when mapping the read pairs. For this reason, we replaced the trimmomatic tool by cutadapt which allows paired processing (in two pass) of the two read files and that will eliminate orphan reads resulting from full removal of one end.

Reads with over-represented adaptors were identifed using fastQC and the identified adaptor sequences stored. Subsequently, cutadapt was applied to remove adaptor sequences and filter out reads with below-threshold base calls. Remaining read-pairs were stored to new paired fastQ files for each dataset.

We provide here results obtained from the second read pair N61311_Dex (SRR1039509_1.fastq.gz and SRR1039509_2.fastq.gz). Similar results were obtained for all other pairs. All data files necessary to reproduce the commands presented here were added to the web-server (see bottom link).

Technical.png Note that several samples also contain unpaired reads that are stored in a additional fastq files without suffix. We choose not to use this extra data in this tutorial but Tophat allows adding unpaired reads during mapping, please see manpage for more info


Initial QC with fastQC

FastQC is a JAVA standalone GUI but also a CLI tool. For the ease of training, feel free to use the graphical version by typing 'fastqc' in your terminal.

Alternatively, use the following script to generate a full QC on a given FastQ file. The program crates a GZip-compressed data archive that will need decompression to access the text details. script

# takes input FastQ from $1
# add to existing folder
mkdir -p ${outfolder}
# perform a full QC control on the reads (-q for quiet)
# we here use 4 threads to speed it up
fastqc -f fastq -t 4 -o ${outfolder} -q $1
# convert results to a nice PDF file
# requires htmldoc and dependencies installed on your machine
htmldoc --webpage \
    --browserwidth 800 \
    --fontsize 7 \
    -f ${outfolder}/fastqc_report.pdf \
FastQC plots for SRR1039509_1 before trimming PDF
FastQC plots for SRR1039509_2 before trimming PDF

Noticeably, both read files contain a low amount overrepresented adaptors identified by FastQC (0,52% and 0,12% of the reads respectively)

# extract the next two lines in the text data file after the line containing '>>Overrepresented sequences' and reformat the result
cat SRP033351_read_qc-Results/SRR1039509_1_fastqc/fastqc_data.txt \
    | grep -A 2 ">>Overrepresented se
quences" | tail -2 | transpose -t | column -t -s $'\t'
Count            111922
Percentage       0.5290392800391875
Possible Source  TruSeq Adapter, Index 4 (100% over 51bp)
# alternatively using 'sed' and returning all overrepresented sequences
sed -e '1,/>>Overrepresented sequences/d' -e '/>>END_MODULE/,$d' ${infile}
cat SRP033351_read_qc-Results/SRR1039509_2_fastqc/fastqc_data.txt | \
    grep -A 2 ">>Overrepresented se
quences" | tail -2 | transpose -t | column -t -s $'\t'
Count            25479
Percentage       0.12043558742801647
Possible Source  Illumina Single End PCR Primer 1 (100% over 45bp)
# alternatively using 'sed' and returning all overrepresented sequences
sed -e '1,/>>Overrepresented sequences/d' -e '/>>END_MODULE/,$d' ${infile}

Technical.png note the $'\t' in the code above!, this allows having a real TAB-character pasted to the terminal instead of several spaces when the "\t" is interpreted by the system

Several tools exist that can remove adaptors from reads, we chose cutadapt below that has the unique advantage of being applicable to paired reads and which is also able to filter reads of poor call quality.


Filtering poor quality reads and trimming over-represented adapters

The following script was developed to filter all read pairs and remove two adaptors identified with FastQC (code derived from [9]). Applying this script to the data was eased by the use of the GNU parallel program that maximized the use of the available cpus and RAM and processed up to 5 pairs concurrently instead of sequentially. The full process took a couple of hours on the BITS server.

The options applied in the command are minimal to avoid introducing uncontrolled bias and read as follows:

  • Discard trimmed reads that are shorter than 20bases after trimming (-m 20, --minimum-length=20)
  • Trim low-quality ends from reads before adapter removal if avgQ is less than 10 (-q 10) script

# apply to SRR1039509_1/2
# takes a fastQ file from $1 and performs 'cutadapt' cleaning
if [ $# -ge 1 ]
echo "Usage: ${0##*/} <fastQ file_1> (second fastQ is deduced)"
## run using gnu-parallel with 5 concurrent jobs like
## find SRP012167_fastq/*_1.fastq.gz \
##	| parallel --no-notice \
##		--workdir . --tmpdir ./tmp \
##		-j 5 scripts/ {}
mkdir -p ${outfolder}
# save tmp files elsewhere
mkdir -p ${tmpdir}
# process the FastQ pair
name1=$(basename ${fq1} .fastq.gz)
name2=$(basename ${fq2} .fastq.gz)
echo "# decompressing input files"
gzip -dc -f ${fq1} > ${tmpdir}/${name1}.fastq
gzip -dc -f ${fq2} > ${tmpdir}/${name2}.fastq
echo "# removing adaptors from read pair: ${name1} and ${name2}"
# TruSeq Adapter, Index 1 (100% over 50bp)
# TruSeq Adapter, Index 4 (100% over 51bp) - SRR1039509_1.fastq.gz
cutadapt -q 10 \
	-a ${ADAPTER_FWD} \
	--minimum-length 20 \
	--paired-output ${tmpdir}/${name2}.tmp.fastq \
	-o ${tmpdir}/${name1}.tmp.fastq \
	${tmpdir}/${name1}.fastq \
	${tmpdir}/${name2}.fastq \
	2>&1 | tee -a ${outfolder}/cutadapt-${name1%%_1}_stats.txt
echo "# done for ${name1} and forward adaptor"
# Illumina Single End PCR Primer 1 (100% over 50bp)
# Illumina Single End PCR Primer 1 (100% over 45bp) - SRR1039509_2.fastq.gz
cutadapt -q 15 \
	-a ${ADAPTER_REV} \
	--minimum-length 20 \
	--paired-output ${tmpdir}/${name1}.trimmed.fastq \
	-o ${tmpdir}/${name2}.trimmed.fastq \
	${tmpdir}/${name2}.tmp.fastq \
	${tmpdir}/${name1}.tmp.fastq \
	2>&1 | tee -a ${outfolder}/cutadapt-${name1%%_1}_stats.txt
echo "# done for ${name2} and reverse adaptor"
echo "# compressing results and removing tmp files"
bgzip -c ${tmpdir}/${name1}.trimmed.fastq > \
	${outfolder}/${name1}.fastq.gz \
	&& rm ${tmpdir}/${name1}*
bgzip -c ${tmpdir}/${name2}.trimmed.fastq > \
	${outfolder}/${name2}.fastq.gz \
	&& rm ${tmpdir}/${name2}*
echo "# done all"

cutadapt stats extracted from the log file

# forward trimming summary;
ACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAA --minimum-length 20 --paired-output ./tmp/SRR1039509_2.tmp.fastq -o ./tmp/SRR1039509_1.tmp.fastq ./tmp/SRR1039509_1.fastq ./tmp/SRR1039509_2.fastq
Maximum error rate: 10.00%
   No. of adapters: 1
   Processed reads:     21155707
   Processed bases:   1332809541 bp (1332.8 Mbp)
     Trimmed reads:       765030 (3.6%)
   Quality-trimmed:     12389232 bp (12.4 Mbp) (0.93% of total)
     Trimmed bases:     18216792 bp (18.2 Mbp) (1.37% of total)
   Too short reads:       308753 (1.5% of processed reads)
    Too long reads:            0 (0.0% of processed reads)
        Total time:    645.05 s
     Time per read:      0.030 ms

# reverse trimming summary
GTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAAAAAA --minimum-length 20 --paired-output ./tmp/SRR1039509_1.trimmed.fastq -o ./tmp/SRR1039509_2.trimmed.fastq ./tmp/SRR1039509_2.tmp.fastq ./tmp/SRR1039509_1.tmp.fastq
Maximum error rate: 10.00%
   No. of adapters: 1
   Processed reads:     20846954
   Processed bases:   1313358102 bp (1313.4 Mbp)
     Trimmed reads:       341413 (1.6%)
   Quality-trimmed:     17888827 bp (17.9 Mbp) (1.36% of total)
     Trimmed bases:      2668254 bp (2.7 Mbp) (0.20% of total)
   Too short reads:       161971 (0.8% of processed reads)
    Too long reads:            0 (0.0% of processed reads)
        Total time:    615.04 s
     Time per read:      0.030 ms


Post-cleaning QC with fastQC

We use the same script as above but this time on the trimmed data.

FastQC plots for SRR479052_1 after trimming PDF
FastQC plots for SRR479052_2 after trimming PDF

FastQC confirmed the full removal of both side adapters by cutadapt. The effect of trimming adaptors in a significant fraction of the reads but the clear artifact seen in the tile picture was not removed by cutadapt (processing read ends). Most other samples did not show quality drop nor left-over adaptors and it was decided here to not process them with cutadapt. The original reads were used for the mapping to the human reference detailed in NGS_RNASeq_DE Exercise.2.


Identifying trimming effect on mapping

In order to evaluate the effect of trimming, reads (unprocessed or trimmed) from the 'SRR1039509' sample were mapped to the full reference genome and mapping results compared. The results of this analysis are stored in a separate page NGS RNASeq DE Exercise.1c for convenience.


Download the bowtie2 reference genome index for hg19 and related files

  • In the follow-up page NGS_RNASeq_DE_Exercise.1a, we will download the reference genome files and deploy them on the BITS laptop that will be used in other parts of this session.
  • In the next exercise (NGS_RNASeq_DE_Exercise.1b) we will then map all reads from one sample to the human hg19 reference and extract reads mapping to chr22 as a smaller dataset for use during the remaining of this session.


download exercise files

Download exercise files here.

Use the right application to open the files present in ex1-files

  2. Blanca E Himes, Xiaofeng Jiang, Peter Wagner, Ruoxi Hu, Qiyu Wang, Barbara Klanderman, Reid M Whitaker, Qingling Duan, Jessica Lasky-Su, Christina Nikolos, William Jester, Martin Johnson, Reynold A Panettieri, Kelan G Tantisira, Scott T Weiss, Quan Lu
    RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells.
    PLoS ONE: 2014, 9(6);e99625
    [PubMed:24926665] ##WORLDCAT## [DOI] (I e)


[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.2 ]