RNASeq analysis of drosophila reads
Command-line version of the Galaxy hands-on session held on September 20th 2012 in Gasthuisberg (Leuven)
[ Main_Page ]
Contents
- 1 References
- 2 Tutorial workflow
- 2.1 Get the necessary raw data and annotations from the internet
- 2.2 Remove adapters from the raw reads with 'FASTX-Toolkit'
- 2.3 Perform quality control on the reads with 'FastQC'
- 2.4 Map the reads to the reference genome with 'TopHat2'
- 2.5 Summarize mapping results and perform mapping QC with 'RSeQC'
- 2.6 Mark read duplicates (optional step)
- 2.7 Compute differential expression between sample pairs with 'cuffdiff'
- 3 Resources for Functional analysis
- 4 Discussion
- 5 Download Hands-on data and results
References
This tutorial recapitulate a Galaxy hands-on session co-organised by BITS and the research group of Prof. S. Aerts. The main aim of the present document is to demystify the use of command line interface (CLI) to perform NGS analysis and to show that it can be replicated with a minimal computer knowledge when a Galaxy is not available or CLI preferred to the popular Web platform.
- Original authors: Delphine POTIER, Rekin's JANKY & Stein AERTS; LCB, KU Leuven.
- LabURL: http://gbiomed.kuleuven.be/english/research/50000622/lcb/
- Workshop: 'Next Generation Sequencing Workshop'; Leuven on Sept, 20 2012
- WS-Program: http://www.vib.be/en/training/research-training/courses/Documents/PROGRAM_NGS.pdf
- Resources: http://ngsworkshop.aertslab.org
- This document and embedded Scripts : <http://www.bits.vib.be>, mailto:<bits@vib.be
Tutorial workflow
- Get the necessary raw data and annotations from the internet
- Remove adapters from the raw reads with 'FASTX-Toolkit'
- Perform quality control on the reads with 'FastQC'
- Map the reads to the reference genome with 'TopHat2'
- Summarize mapping results and perform mapping QC with 'RSeQC'
- Mark read duplicates (optional step)
- Compute differential expression between sample pairs with 'cuffdiff'
Biological and functional analysis steps presented during the workshop and including gene-set enrichment, gene ontology, or pathway analyses are not replicated here
Get the necessary raw data and annotations from the internet
A bash script (*get_data.sh*) was designed to download the necessary data from the 'Laboratory of Computational Biology' page. This tutorial is based on *Drosophila melanogaster* sequencing data but can easily be adapted to work with human, mouse or any other species for which genome sequence and gene annotations are available. Requirements for this scripts should be present on all decent Unix computer (bash, wget).
Nothing being eternal!, when links point to error pages, please contact BITS to obtain a backup copy of the data.
As seen above, the FastQ reads files are small and only contain reads identified in a former full-maping as aligning on the chrX region. The BAM files, instead, result from the mapping of all reads to the full drosophila reference genome
#!/bin/bash # scriptname: get_data.sh # download raw read files and annotations required for the tutorial echo "## This script get the required data from the Internet" # create a folder for the data dest="data" mkdir -p ${dest} \ || (echo "# you need to run this script from a writable folder!"; exit 0) # urls lcburl="http://ngsworkshop.aertslab.org/" flyurl="ftp://ftp.flybase.net/genomes/Drosophila_melanogaster" # list of files LIST=$(cat <<END_HEREDOC ${lcburl}/NGS-workshop-2012.pdf ${lcburl}/Tophat_for_Illumina_on_Dmel1_accepted_hits.bam ${lcburl}/Tophat_for_Illumina_on_Dmel1_accepted_hits.bam.bai ${lcburl}/Tophat_for_Illumina_on_Dmel2_accepted_hits.bam ${lcburl}/Tophat_for_Illumina_on_Dmel2_accepted_hits.bam.bai ${lcburl}/Tophat_for_Illumina_on_mut-r2_accepted_hits.bam ${lcburl}/Tophat_for_Illumina_on_mut-r2_accepted_hits.bam.bai ${lcburl}/Tophat_for_Illumina_on_mut-r3_accepted_hits.bam ${lcburl}/Tophat_for_Illumina_on_mut-r3_accepted_hits.bam.bai ${lcburl}/chrX_9000000_9400000_Dmel1.fastq ${lcburl}/chrX_9000000_9400000_Dmel2.fastq ${lcburl}/chrX_9000000_9400000_glass-mutant_r2.fastq ${lcburl}/chrX_9000000_9400000_glass-mutant_r3.fastq ${lcburl}/Flybase2006.gtf ${flyurl}/dmel_r5.45_FB2012_03/gff/dmel-all-r5.45.gff.gz ${flyurl}/dmel_r5.45_FB2012_03/fasta/dmel-all-chromosome-r5.45.fasta.gz END_HEREDOC ) cd ${dest} for file in ${LIST}; do name=${file##*/} if [ ! -e "${name}" ]; then echo ${file} wget -np -nv --timestamping ${file} && echo "# OK!" else echo "# "${name}": already downloaded!" fi done echo "# all downloads done!"
Remove adapters from the raw reads with 'FASTX-Toolkit'
FastX_toolkit [1] allows quality control and basic manipulations on fastQ data including trimming, clipping, and barcode splitting of multiplex runs.
Modern mappers (aligners) should handle adaptors and ignore them as they will likely not resemble the reference sequence. In some instances, the presence of adaptors could slow down mapping and introduce bias in the analysis. More importantly, QC results when analyzing adaptor-contaminated reads will not look good and interfere with result interpretation. For this reason alone, we advise to remove cloning and amplification sequences before performing QC and mapping of short reads.
Four Fastq files represent the raw Data
They correspond to 2 replicates for each conditions containing reads for the region chrX:9000000-9400000:
- wildtype eye/antennal discs (Dmel1 and Dmel2)
- chrX_9000000_9400000_Dmel1.fastq
- chrX_9000000_9400000_Dmel2.fastq
- glass mutant eye/antennal discs (glass-mutant_r2 and glass-mutant_r3)
- chrX_9000000_9400000_glass-mutant_r2.fastq
- chrX_9000000_9400000_glass-mutant_r3.fastq
Each has a common adaptor #1 and a specific adaptor #2 to be removed (see code below)
The following script (*clip_adaptors.sh*) will take care of removing adaptors and obtain cleaned reads suitable for mapping. Save the code and execute it as previously.
#!/bin/bash # scriptname: clip_adaptors.sh # adaptator #1 (same for all) a1="CATTTGTTTGGCTATTAATTGAACAAATGAGATCGGAAGAGCACACGTCT" # adaptator #2 (sample specific) a2_Dmel1="GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG" a2_Dmel2="GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG" a2_r2="GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG" a2_r3=${a2_r2} for fq in $(ls data/*.fastq); do echo "# clipping: ${fq}" name=$(basename "${fq##*_}" .fastq) var="a2_${name}" echo "# a1 = ${a1}" # this one is a bit more tricky!, the name of the variable containing the # second adaptor is calculated from the name of the fastq file # thanks to the use of the '!' symbol. echo "# var = ${!var}" # execute the command and write summary results in a 'log' file cmd="(fastx_clipper -Q33 -a ${a1} -M15 -n -v -l 20 -i ${fq} | \ fastx_clipper -Q33 -a ${!var} -M15 -n -v -l 20 >data/clipped_${name}.fastq) \ 2> data/clipper-log_${name}.txt" echo "# ${cmd}" eval ${cmd} echo done
Perform quality control on the reads with 'FastQC'
FastQC [2]can be run at command-line or from the java GUI to perform reliable QC of short reads. The following code will run fastqc from the CLI for each clipped fastq file adn produce a zip file with html results in the data folder. The 'fastqc' executable file should be known by your system (reachable from the $PATH environment variable). The results are viewable after opening the archive and pointing your web browser to the 'fastqc_report.html' file. The following code (*run_fastqc.sh*) will perform QC on each clipped file.
#!/bin/bash # scriptname: run_fastqc.sh for cfq in $(ls data/clipped_*.fastq); do echo "# clipping: ${cfq}" cmd="fastqc --noextract -f fastq ${cfq}" echo "# ${cmd}" eval ${cmd} echo done
The top part of a common fastQC report shows diagnose results with detailed illustrations.
Map the reads to the reference genome with 'TopHat2'
In RNASeq, single or paired reads should all align to transcript sequences but not necessarily to the genomic reference due to splice events. Tophat uses a hash alignment algorithm (Bowtie) and allows genome sequence rearrangements resulting from splicing present in RNASeq analysis. When mapping larger number of reads in gDNASeq experiments, other software will be preferred, that align the reads with the gDNA (eg BWA or the more widely used Bowtie.
As this is *single-end* RNASeq data, we will apply *Tophat* with the *tophat_map.sh* script and use the downloaded chromosome sequences (r5.45) and accompanying transcript GFF database *dmel-all-r5.45.gff* to map the reads to the known Drosophila melanogaster transcripts (build:FlyBase FB2012_03).
Tophat [3] can be applied in two different ways. The most straightforward mode is *independent from any known transcripts* database and will try to align the reads directly to the genome. All reads that do not align in their whole length are collected and used to search for consensus splice sites from which exon models are built. This mode is unbiased by former knowledge but obviously takes more time to compute both steps and may identify novel exons that are not documented by functional data. The second mode is *using prior transcript information* and mapping the reads to a transcript database (GFF file). The reads that do not map to known transcripts are used, (this can be turned off by option: -no-novel-juncs) in a second step, to build novel exon models.
Although it might be preferable to apply the faster transcript-alignment mode when the user does not whish to interpret novel transcript models; we will apply both methods here to generate data that can be compared using other tools. Only the data obtained with the full genome mapping will be further procecced in this tutorial.
In the next script (*run_tophat.sh*) a number of steps are performed in order to:
- decompress and index the genome fasta file
- build a bowtie2 index for the reference genome
- create a corrected copy of the provided 'Flybase2006.gtf' where 'chr' is removed from the chromosome names
then for each clipped-read data file:
- map the clipped reads in un-guided modes
- map the clipped reads in guided modes with 'Flybase2006.gtf'
- copy both BAM-formatted mapping results and index them for later use
#!/bin/bash # scriptname: run_tophat.sh data="data" results="tophat_results" mkdir -p ${results} fasta=dmel-all-chromosome-r5.45.fasta origtf=${data}/Flybase2006.gtf gtffile=${results}/transcripts.gtf nthr=4 # using n processors # extract and index genome fasta file gunzip -c ${data}/${fasta}.gz \ > ${results}/${fasta} \ && samtools faidx ${results}/${fasta} # create a corrected copy of ${origtf} # where 'chr' is removed from the chromosome names. # This is necessary in order to match the chromosome names in the sequence file. # Also add a link with .gff extension for good measure. # you can check both files with either: # bowtie-inspect --names # for the index file # and # head ${gtffile} # for the annotation database perl -p -e 's/^chr//' ${origtf} > ${gtffile} # build index for the reference genome if not present if [ ! -f "${results}/dm3idx.1.bt2" ]; then bowtie2-build ${results}/${fasta} ${results}/dm3idx # also create a named copy of the fasta file bowtie2-inspect ${results}/dm3idx \ > ${results}/dm3idx.fa \ && samtools faidx ${results}/dm3idx.fa fi # map single ends in unguided and guided modes (not|using the gtf file) for fq in $(ls data/clipped_*.fastq); do echo "# mapping: ${fq}" name=$(basename "${fq##*_}" .fastq) # execute the command and write summary results in a 'log' file cmd="tophat -p ${nthr} \ -o ${results}/${name}_all-mappings \ --no-coverage-search \ ${results}/dm3idx \ ${fq} \ 2> ${results}/tophat-all-log_${name}.txt" echo "# ${cmd}" eval ${cmd} # copy, rename, and index mapping results if [ $? = 0 ]; then cp ${results}/${name}_all-mappings/accepted_hits.bam \ ${results}/${name}_all-tophat.bam \ && samtools index ${results}/${name}_all-tophat.bam fi echo "# mapping: ${fq} with transcript models" # execute the command and write summary results in a 'log' file # first time -G ${gtffile} then --transcriptome-index ${results}/transcripts cmd="tophat \ -p ${nthr} \ -o ${results}/${name}_gtf-mappings \ -G ${gtffile} \ --no-coverage-search ${results}/dm3idx ${fq} \ 2> ${results}/tophat-gtf-log_${name}.txt" echo "# ${cmd}" eval ${cmd} # copy, rename, and index mapping results if [ $? = 0 ]; then cp ${results}/${name}_gtf-mappings/accepted_hits.bam \ ${results}/${name}_gtf-tophat.bam \ && samtools index ${results}/${name}_gtf-tophat.bam fi echo done
Summarize mapping results and perform mapping QC with 'RSeQC'
Some biases are known and expected in each NGS mapping experiment. They include GC-bias and 5' or 3' mapping decrease associated with exonuclease degradation of the mRNA during library construction. It is very important to identify these biases before performing differential expression analysis.
A very handy RSeQC toolbox python package provides various useful methods to assess RNASeq distribution of different biases (RSeQC manual [4]).
As example, we will analyse here the 'Read coverage over gene body' with a script (*run_RSeQC.sh*) driving the command 'geneBody_coverage.py' which takes a bam file and gene model BED12 file to compute percentile coverage across all transcript models.
#!/bin/bash # scriptname: run_rseqc.sh data="data" results="tophat_results" qc_results="RSeQC_results" mkdir -p ${qc_results} # reference files from UCSC table browser orirefgene=${data}/dm3-refgene.bed12 refgene=${data}/refgene.bed12 # sort and remove 'chr' from UCSC bed12 data bedtools sort -i ${orirefgene} | perl -p -e 's/^chr//' > ${refgene} # compute coverage for all transcript models in percentiles for bam in $(ls ${results}/*-tophat.bam); do echo "# mapping: ${bam}" name=$(basename "${bam%%-tophat.bam}") echo ${name} res=${qc_results}/${name} cmd="geneBody_coverage.py -r ${refgene} -i ${bam} -o ${res}" echo ${cmd} eval ${cmd} done
The analysis on all reads from the first sample (Dmel1) gives the expected bell-shape while the QC on the locus-selected reads leads to a somewhat lower resolution graph due to the limited number of transcripts summarised in that region.
Other interesting QC analyses can be performed using 'RSeQC toolbox', including computing feature distribution (UTR vs introns, and exons) or mappability. Please refer to the manual for more information about 'RSeQC'.
Mark read duplicates (optional step)
This step is not required for the current tutorial but shown as example
Picard tools [6] include a specific command to mark read duplicates in BAM data, a test usage *run_picard_md.sh* is presented below to illustrate this tool.
#!/bin/bash # scriptname: run_picard_md.sh # Mark and remove duplicates with Picard tools # change this to the path to your copy of Picard picard_path=$BIOTOOLS/picard data="data" results=${data}/deduplicated_bam-files mkdir -p ${results} # prepare parameters for command below ARGS="-Xmx4G" RD="true" # true will remove them while false will only mark them as duplicates AS="true" # sorted data is expected VS="LENIENT" # be tolerant for minor SAM format issues # very important variable to identify the info contained in the read names # note the added back-slash+quotes around this complex string RNR="\"[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*\"" for bam in $(ls ${data}/*_accepted_hits.bam); do # define file names infile=$(basename ${bam}) outfile=${result}/${infile/_accepted_hits.bam/-rmdup.bam} sumfile=${result}/${infile/_accepted_hits.bam/-dedup_output.txt} echo "# marking duplicates in: ${bam}" # build command cmd="java ${ARGS} \ -jar ${picard_path}/MarkDuplicates.jar \ I=${bam} \ O=${outfile} \ METRICS_FILE=${sumfile} \ REMOVE_DUPLICATES=${RD} \ ASSUME_SORTED=${AS} \ VALIDATION_STRINGENCY=${VS} \ READ_NAME_REGEX=${RNR}" echo "# ${cmd}" eval ${cmd} echo done
Compute differential expression between sample pairs with 'cuffdiff'
Find significant changes in transcript/gene expression starting from BAM files and a GTF file using FPKM. To test whether an observed difference in a gene's expression is significant, Cuffdiff compares the log ratio of gene's expression in two conditions against the log of one (*run_cuffdiff.sh*).
#!/bin/bash # scriptname: run_cuffdiff.sh data="data" cuffdiff_results="cuffdiff_results" mkdir -p ${cuffdiff_results} # using n processors nthr=4 # locate reference files transcripts=${data}/Flybase2006.gtf labels="wild_types,mutants" wt=$(ls -dm ${data}/Tophat_for_Illumina_on_Dmel?_accepted_hits.bam | tr -d '\ ' | tr -d '\n') mut=$(ls -dm ${data}/Tophat_for_Illumina_on_mut-r?_accepted_hits.bam | tr -d '\ ' | tr -d '\n') # perform cuffdiff analysis with target FDR of 5% (default) cmd="cuffdiff --no-update-check -q -p ${nthr} \ -c 10 \ --FDR 0.05\ --library-norm-method quartile \ -L ${labels} \ ${transcripts} ${wt} ${mut} \ -o ${cuffdiff_results}" echo "# ${cmd}" eval ${cmd}
The results of this analysis at gene level are found in 'gene_exp.diff' and shown in **Figure 3**. Such results can be used to prioritise and perform downstream functional analyses with other software tools (GO, GSEA, IPA, ...).
This run was done with reads mapping to a 400Mb locus of chrX and therefore not representative of the full genome. A similar anaysis performed on the full set of reads would be required to allow functional analysis
Resources for Functional analysis
- Gene Ontology enRIchment anaLysis and visuaLizAtion tool (GORILLA) [7]
- DAVID Bioinformatics Resources 6.7 [8]
- Gene Set Enrichment Analysis (GSEA) [9]
- a number of R/Bioconductor packages can also be used here
- and obviously, if you have access to it, Ingenuity Pathway Analysis (IPA) will be perfect here.
Discussion
This simplified tutorial proposes a parallel to the 2012 workshop without using Galaxy. It is clear from this report that Galaxy presents advantages by standardising the workflow and allowing to insert conversion or reformatting steps between tools. On the other hand, full command line processing allows more customisation and structuring of the output as well as generation of full workflow made from chained scripts that can be re-applied on new data with minor changes.
The most important take home message is that under any circumstances, NGS analysis requires constant review of every step results to identify pitfalls and biases and will likely never become a fully automated task until all reference data and tools will speak the same language. The end user will select the approach which most fits his needs and rationale.
More information can be obtained from many places including
- http://wiki.bits.vib.be/index.php/RNA-seq
- http://cbsu.tc.cornell.edu/lab/doc/biohpc_lab2_exercise.pdf
Download Hands-on data and results
Download hands-on data from the BITS server
References:
- ↑ <http://hannonlab.cshl.edu/fastx_toolkit/>
- ↑ <http://www.bioinformaDcs.babraham.ac.uk/projects/fastqc/>
- ↑ http://tophat.cbcb.umd.edu
- ↑ http://dldcc-web.brc.bcm.edu/lilab/liguow/CGI/rseqc/_build/html/index.html
- ↑ http://seqanswers.com/forums/showthread.php?t=6854
- ↑ http://picard.sourceforge.net
- ↑ <http://cbl-gorilla.cs.technion.ac.il>
- ↑ <http://david.abcc.ncifcrf.gov>
- ↑ <http://www.broadinstitute.org/gsea/index.jsp>
[ Main_Page ]