RNASeq analysis of drosophila reads

From BITS wiki
Jump to: navigation, search

Command-line version of the Galaxy hands-on session held on September 20th 2012 in Gasthuisberg (Leuven)

[ Main_Page ]

rnaseq2012.png

References

This tutorial recapitulate a Galaxy hands-on session co-organised by BITS and the research group of Prof. S. Aerts. The main aim of the present document is to demystify the use of command line interface (CLI) to perform NGS analysis and to show that it can be replicated with a minimal computer knowledge when a Galaxy is not available or CLI preferred to the popular Web platform.

Tutorial workflow

  • Get the necessary raw data and annotations from the internet
  • Remove adapters from the raw reads with 'FASTX-Toolkit'
  • Perform quality control on the reads with 'FastQC'
  • Map the reads to the reference genome with 'TopHat2'
  • Summarize mapping results and perform mapping QC with 'RSeQC'
  • Mark read duplicates (optional step)
  • Compute differential expression between sample pairs with 'cuffdiff'

Technical.png Biological and functional analysis steps presented during the workshop and including gene-set enrichment, gene ontology, or pathway analyses are not replicated here

Get the necessary raw data and annotations from the internet

A bash script (*get_data.sh*) was designed to download the necessary data from the 'Laboratory of Computational Biology' page. This tutorial is based on *Drosophila melanogaster* sequencing data but can easily be adapted to work with human, mouse or any other species for which genome sequence and gene annotations are available. Requirements for this scripts should be present on all decent Unix computer (bash, wget).

Technical.png Nothing being eternal!, when links point to error pages, please contact BITS to obtain a backup copy of the data.

raw_files.png

Technical.png As seen above, the FastQ reads files are small and only contain reads identified in a former full-maping as aligning on the chrX region. The BAM files, instead, result from the mapping of all reads to the full drosophila reference genome

#!/bin/bash
 
# scriptname: get_data.sh
 
# download raw read files and annotations required for the tutorial 
 
echo "## This script get the required data from the Internet"
 
# create a folder for the data
dest="data"
mkdir -p ${dest} \
	|| (echo "# you need to run this script from a writable folder!"; exit 0)
 
# urls
lcburl="http://ngsworkshop.aertslab.org/"
flyurl="ftp://ftp.flybase.net/genomes/Drosophila_melanogaster"
 
# list of files
LIST=$(cat <<END_HEREDOC
${lcburl}/NGS-workshop-2012.pdf
${lcburl}/Tophat_for_Illumina_on_Dmel1_accepted_hits.bam
${lcburl}/Tophat_for_Illumina_on_Dmel1_accepted_hits.bam.bai
${lcburl}/Tophat_for_Illumina_on_Dmel2_accepted_hits.bam
${lcburl}/Tophat_for_Illumina_on_Dmel2_accepted_hits.bam.bai
${lcburl}/Tophat_for_Illumina_on_mut-r2_accepted_hits.bam
${lcburl}/Tophat_for_Illumina_on_mut-r2_accepted_hits.bam.bai
${lcburl}/Tophat_for_Illumina_on_mut-r3_accepted_hits.bam
${lcburl}/Tophat_for_Illumina_on_mut-r3_accepted_hits.bam.bai
${lcburl}/chrX_9000000_9400000_Dmel1.fastq
${lcburl}/chrX_9000000_9400000_Dmel2.fastq
${lcburl}/chrX_9000000_9400000_glass-mutant_r2.fastq
${lcburl}/chrX_9000000_9400000_glass-mutant_r3.fastq
${lcburl}/Flybase2006.gtf
${flyurl}/dmel_r5.45_FB2012_03/gff/dmel-all-r5.45.gff.gz
${flyurl}/dmel_r5.45_FB2012_03/fasta/dmel-all-chromosome-r5.45.fasta.gz
END_HEREDOC
)
 
cd ${dest}
for file in ${LIST}; do
 name=${file##*/}
 if [ ! -e "${name}" ]; then
 echo ${file}
 wget -np -nv --timestamping ${file} && echo "# OK!"
else
 echo "# "${name}": already downloaded!"
fi
done
 
echo "# all downloads done!"

Remove adapters from the raw reads with 'FASTX-Toolkit'

FastX_toolkit [1] allows quality control and basic manipulations on fastQ data including trimming, clipping, and barcode splitting of multiplex runs.

Modern mappers (aligners) should handle adaptors and ignore them as they will likely not resemble the reference sequence. In some instances, the presence of adaptors could slow down mapping and introduce bias in the analysis. More importantly, QC results when analyzing adaptor-contaminated reads will not look good and interfere with result interpretation. For this reason alone, we advise to remove cloning and amplification sequences before performing QC and mapping of short reads.

Four Fastq files represent the raw Data

They correspond to 2 replicates for each conditions containing reads for the region chrX:9000000-9400000:

  • wildtype eye/antennal discs (Dmel1 and Dmel2)
    • chrX_9000000_9400000_Dmel1.fastq
    • chrX_9000000_9400000_Dmel2.fastq
  • glass mutant eye/antennal discs (glass-mutant_r2 and glass-mutant_r3)
    • chrX_9000000_9400000_glass-mutant_r2.fastq
    • chrX_9000000_9400000_glass-mutant_r3.fastq

Each has a common adaptor #1 and a specific adaptor #2 to be removed (see code below)

The following script (*clip_adaptors.sh*) will take care of removing adaptors and obtain cleaned reads suitable for mapping. Save the code and execute it as previously.

#!/bin/bash
 
# scriptname: clip_adaptors.sh
 
# adaptator #1 (same for all)
a1="CATTTGTTTGGCTATTAATTGAACAAATGAGATCGGAAGAGCACACGTCT"
# adaptator #2 (sample specific)
a2_Dmel1="GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG"
a2_Dmel2="GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG"
a2_r2="GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG"
a2_r3=${a2_r2}
 
for fq in $(ls data/*.fastq); do
 echo "# clipping: ${fq}"
 
 name=$(basename "${fq##*_}" .fastq)
 var="a2_${name}"
 echo "# a1 = ${a1}"
 # this one is a bit more tricky!, the name of the variable containing the
 # second adaptor is calculated from the name of the fastq file 
 # thanks to the use of the '!' symbol.
 echo "# var = ${!var}"
 # execute the command and write summary results in a 'log' file
 cmd="(fastx_clipper -Q33 -a ${a1} -M15 -n -v -l 20 -i ${fq} | \
  fastx_clipper -Q33 -a ${!var} -M15 -n -v -l 20 >data/clipped_${name}.fastq) \
  2> data/clipper-log_${name}.txt"
 echo "# ${cmd}"
 eval ${cmd}
 echo
done

Perform quality control on the reads with 'FastQC'

FastQC [2]can be run at command-line or from the java GUI to perform reliable QC of short reads. The following code will run fastqc from the CLI for each clipped fastq file adn produce a zip file with html results in the data folder. The 'fastqc' executable file should be known by your system (reachable from the $PATH environment variable). The results are viewable after opening the archive and pointing your web browser to the 'fastqc_report.html' file. The following code (*run_fastqc.sh*) will perform QC on each clipped file.

#!/bin/bash
 
# scriptname: run_fastqc.sh
 
for cfq in $(ls data/clipped_*.fastq); do
 echo "# clipping: ${cfq}"
 cmd="fastqc --noextract -f fastq ${cfq}"
 echo "# ${cmd}"
 eval ${cmd}
 echo
done

The top part of a common fastQC report shows diagnose results with detailed illustrations.

fastqc_example.png

Map the reads to the reference genome with 'TopHat2'

In RNASeq, single or paired reads should all align to transcript sequences but not necessarily to the genomic reference due to splice events. Tophat uses a hash alignment algorithm (Bowtie) and allows genome sequence rearrangements resulting from splicing present in RNASeq analysis. When mapping larger number of reads in gDNASeq experiments, other software will be preferred, that align the reads with the gDNA (eg BWA or the more widely used Bowtie.

As this is *single-end* RNASeq data, we will apply *Tophat* with the *tophat_map.sh* script and use the downloaded chromosome sequences (r5.45) and accompanying transcript GFF database *dmel-all-r5.45.gff* to map the reads to the known Drosophila melanogaster transcripts (build:FlyBase FB2012_03).

Tophat [3] can be applied in two different ways. The most straightforward mode is *independent from any known transcripts* database and will try to align the reads directly to the genome. All reads that do not align in their whole length are collected and used to search for consensus splice sites from which exon models are built. This mode is unbiased by former knowledge but obviously takes more time to compute both steps and may identify novel exons that are not documented by functional data. The second mode is *using prior transcript information* and mapping the reads to a transcript database (GFF file). The reads that do not map to known transcripts are used, (this can be turned off by option: -no-novel-juncs) in a second step, to build novel exon models.

Although it might be preferable to apply the faster transcript-alignment mode when the user does not whish to interpret novel transcript models; we will apply both methods here to generate data that can be compared using other tools. Only the data obtained with the full genome mapping will be further procecced in this tutorial.

In the next script (*run_tophat.sh*) a number of steps are performed in order to:

  • decompress and index the genome fasta file
  • build a bowtie2 index for the reference genome
  • create a corrected copy of the provided 'Flybase2006.gtf' where 'chr' is removed from the chromosome names

then for each clipped-read data file:

  • map the clipped reads in un-guided modes
  • map the clipped reads in guided modes with 'Flybase2006.gtf'
  • copy both BAM-formatted mapping results and index them for later use
#!/bin/bash
 
# scriptname: run_tophat.sh
 
data="data"
results="tophat_results"
mkdir -p ${results}
fasta=dmel-all-chromosome-r5.45.fasta
origtf=${data}/Flybase2006.gtf
gtffile=${results}/transcripts.gtf
nthr=4 # using n processors
 
# extract and index genome fasta file
gunzip -c ${data}/${fasta}.gz \
	> ${results}/${fasta} \
	&& samtools faidx ${results}/${fasta}
 
# create a corrected copy of ${origtf} 
# where 'chr' is removed from the chromosome names.
# This is necessary in order to match the chromosome names in the sequence file. 
# Also add a link with .gff extension for good measure.
#  you can check both files with either:
#   bowtie-inspect --names # for the index file
#  and
#   head ${gtffile} # for the annotation database
perl -p -e 's/^chr//' ${origtf} > ${gtffile}
 
# build index for the reference genome if not present
if [ ! -f "${results}/dm3idx.1.bt2" ]; then
 bowtie2-build ${results}/${fasta} ${results}/dm3idx
 # also create a named copy of the fasta file
 bowtie2-inspect ${results}/dm3idx \
 	> ${results}/dm3idx.fa \
 	&& samtools faidx ${results}/dm3idx.fa
fi
 
# map single ends in unguided and guided modes (not|using the gtf file)
for fq in $(ls data/clipped_*.fastq); do
 echo "# mapping: ${fq}"
 name=$(basename "${fq##*_}" .fastq)
 
 # execute the command and write summary results in a 'log' file
 cmd="tophat -p ${nthr} \
 	-o ${results}/${name}_all-mappings \
 	--no-coverage-search \
 	${results}/dm3idx \
 	${fq} \
 	2> ${results}/tophat-all-log_${name}.txt"
 echo "# ${cmd}"
 eval ${cmd}
 
 # copy, rename, and index mapping results
 if [ $? = 0 ]; then
  cp ${results}/${name}_all-mappings/accepted_hits.bam \
  	${results}/${name}_all-tophat.bam \
  	&& samtools index ${results}/${name}_all-tophat.bam  
 fi
 echo "# mapping: ${fq} with transcript models"
 
 # execute the command and write summary results in a 'log' file
 # first time -G ${gtffile} then --transcriptome-index ${results}/transcripts 
 cmd="tophat \
 	-p ${nthr} \
 	-o ${results}/${name}_gtf-mappings \
 	-G ${gtffile} \
 	--no-coverage-search ${results}/dm3idx ${fq} \
 	2> ${results}/tophat-gtf-log_${name}.txt"
 echo "# ${cmd}"
 eval ${cmd}
 
 # copy, rename, and index mapping results
 if [ $? = 0 ]; then
  cp ${results}/${name}_gtf-mappings/accepted_hits.bam \
  	${results}/${name}_gtf-tophat.bam \
  	&& samtools index ${results}/${name}_gtf-tophat.bam
 fi
 echo
 
done

Summarize mapping results and perform mapping QC with 'RSeQC'

Some biases are known and expected in each NGS mapping experiment. They include GC-bias and 5' or 3' mapping decrease associated with exonuclease degradation of the mRNA during library construction. It is very important to identify these biases before performing differential expression analysis.

A very handy RSeQC toolbox python package provides various useful methods to assess RNASeq distribution of different biases (RSeQC manual [4]).

As example, we will analyse here the 'Read coverage over gene body' with a script (*run_RSeQC.sh*) driving the command 'geneBody_coverage.py' which takes a bam file and gene model BED12 file to compute percentile coverage across all transcript models.

#!/bin/bash
 
# scriptname: run_rseqc.sh
 
data="data"
results="tophat_results"
qc_results="RSeQC_results"
mkdir -p ${qc_results}
 
# reference files from UCSC table browser
orirefgene=${data}/dm3-refgene.bed12
refgene=${data}/refgene.bed12
 
# sort and remove 'chr' from UCSC bed12 data
bedtools sort -i ${orirefgene} | perl -p -e 's/^chr//' > ${refgene}
 
# compute coverage for all transcript models in percentiles
for bam in $(ls ${results}/*-tophat.bam); do
 echo "# mapping: ${bam}"
 name=$(basename "${bam%%-tophat.bam}")
 echo ${name}
 res=${qc_results}/${name}
 
 cmd="geneBody_coverage.py -r ${refgene} -i ${bam} -o ${res}"
 echo ${cmd}
 eval ${cmd}
done

The analysis on all reads from the first sample (Dmel1) gives the expected bell-shape while the QC on the locus-selected reads leads to a somewhat lower resolution graph due to the limited number of transcripts summarised in that region.

bodyCvg.png

Other interesting QC analyses can be performed using 'RSeQC toolbox', including computing feature distribution (UTR vs introns, and exons) or mappability. Please refer to the manual for more information about 'RSeQC'.

Mark read duplicates (optional step)

This step is not required for the current tutorial but shown as example

Exact read duplicates (PCR-duplicates) and sequencer optical duplicates may interfere when measuring transcript abundance with RNASeq or computing allele frequencies in gDNASeq. Removing duplicates is recommended by some users and avoided by others. The results of the QC analysis can be used to motivate your choice. An interesting discussion covering this choice can be found on *SeqAnswers* [5].

Picard tools [6] include a specific command to mark read duplicates in BAM data, a test usage *run_picard_md.sh* is presented below to illustrate this tool.

#!/bin/bash
 
# scriptname: run_picard_md.sh
 
# Mark and remove duplicates with Picard tools
 
# change this to the path to your copy of Picard
picard_path=$BIOTOOLS/picard
 
data="data"
results=${data}/deduplicated_bam-files
mkdir -p ${results}
 
# prepare parameters for command below
ARGS="-Xmx4G"
RD="true" # true will remove them while false will only mark them as duplicates
AS="true" # sorted data is expected
VS="LENIENT" # be tolerant for minor SAM format issues
 
# very important variable to identify the info contained in the read names
# note the added back-slash+quotes around this complex string
RNR="\"[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*\""
 
for bam in $(ls ${data}/*_accepted_hits.bam); do
 
# define file names
infile=$(basename ${bam})
outfile=${result}/${infile/_accepted_hits.bam/-rmdup.bam}
sumfile=${result}/${infile/_accepted_hits.bam/-dedup_output.txt}
 
echo "# marking duplicates in: ${bam}"
 
# build command
cmd="java ${ARGS} \
	-jar ${picard_path}/MarkDuplicates.jar \
	I=${bam} \
	O=${outfile} \
	METRICS_FILE=${sumfile} \
	REMOVE_DUPLICATES=${RD} \
	ASSUME_SORTED=${AS} \
	VALIDATION_STRINGENCY=${VS} \
	READ_NAME_REGEX=${RNR}"
 
echo "# ${cmd}"
eval ${cmd}
echo
 
done

Compute differential expression between sample pairs with 'cuffdiff'

Find significant changes in transcript/gene expression starting from BAM files and a GTF file using FPKM. To test whether an observed difference in a gene's expression is significant, Cuffdiff compares the log ratio of gene's expression in two conditions against the log of one (*run_cuffdiff.sh*).

#!/bin/bash
 
# scriptname: run_cuffdiff.sh
 
data="data"
cuffdiff_results="cuffdiff_results"
mkdir -p ${cuffdiff_results}
 
# using n processors
nthr=4 
 
# locate reference files
transcripts=${data}/Flybase2006.gtf
labels="wild_types,mutants"
wt=$(ls -dm ${data}/Tophat_for_Illumina_on_Dmel?_accepted_hits.bam | tr -d '\ ' | tr -d '\n')
mut=$(ls -dm ${data}/Tophat_for_Illumina_on_mut-r?_accepted_hits.bam | tr -d '\ ' | tr -d '\n')
 
# perform cuffdiff analysis with target FDR of 5% (default)
cmd="cuffdiff --no-update-check -q -p ${nthr} \
 -c 10 \
 --FDR 0.05\
 --library-norm-method quartile \
 -L ${labels} \
 ${transcripts} ${wt} ${mut} \
 -o ${cuffdiff_results}"
echo "# ${cmd}"
eval ${cmd}

The results of this analysis at gene level are found in 'gene_exp.diff' and shown in **Figure 3**. Such results can be used to prioritise and perform downstream functional analyses with other software tools (GO, GSEA, IPA, ...).

Technical.png This run was done with reads mapping to a 400Mb locus of chrX and therefore not representative of the full genome. A similar anaysis performed on the full set of reads would be required to allow functional analysis

cuffdiff_table.png

Resources for Functional analysis

  • Gene Ontology enRIchment anaLysis and visuaLizAtion tool (GORILLA) [7]
  • DAVID Bioinformatics Resources 6.7 [8]
  • Gene Set Enrichment Analysis (GSEA) [9]
  • a number of R/Bioconductor packages can also be used here
  • and obviously, if you have access to it, Ingenuity Pathway Analysis (IPA) will be perfect here.

Discussion

This simplified tutorial proposes a parallel to the 2012 workshop without using Galaxy. It is clear from this report that Galaxy presents advantages by standardising the workflow and allowing to insert conversion or reformatting steps between tools. On the other hand, full command line processing allows more customisation and structuring of the output as well as generation of full workflow made from chained scripts that can be re-applied on new data with minor changes.

The most important take home message is that under any circumstances, NGS analysis requires constant review of every step results to identify pitfalls and biases and will likely never become a fully automated task until all reference data and tools will speak the same language. The end user will select the approach which most fits his needs and rationale.

More information can be obtained from many places including

Download Hands-on data and results

Download hands-on data from the BITS server

Use the right application to open the files present in hands-on related files

References:
  1. <http://hannonlab.cshl.edu/fastx_toolkit/>
  2. <http://www.bioinformaDcs.babraham.ac.uk/projects/fastqc/>
  3. http://tophat.cbcb.umd.edu
  4. http://dldcc-web.brc.bcm.edu/lilab/liguow/CGI/rseqc/_build/html/index.html
  5. http://seqanswers.com/forums/showthread.php?t=6854
  6. http://picard.sourceforge.net
  7. <http://cbl-gorilla.cs.technion.ac.il>
  8. <http://david.abcc.ncifcrf.gov>
  9. <http://www.broadinstitute.org/gsea/index.jsp>

[ Main_Page ]