NGS RNASeq DE Exercise.2

From BITS wiki
Jump to: navigation, search

Map paired-end chr22 reads to the human hg19 reference genome using Tophat2

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 | NGS_RNASeq_DE_Exercise.3 ]


Introduction

Mapping cDNA reads to a reference genome can be done against the full reference genome (unbiased) or by selecting a transcriptome model to map the reads that correspond to known transcripts (faster but biased). The alignment is done using Tophat that makes use of the Bowtie2 aligner while allowing splice-broken alignment across exon junctions. One additional feature of Tophat is the optional discovery of new splice events. The full experiment alignment was run on a high end server and the code is provided to you, the mapping results are not fully analyzed in this general protocol but rather used to extract a one-chromosome full read set that is processed during the session.

Analyze paired-end fastq read data with fastQC

As done on the full data, we perform a QC on the freshly extracted reads

run_chr22-fastqc.sh script

#!/bin/bash
# run_fastqc.sh
 
# full data
#export RAWDATA=$(pwd)/SRP033351_fastq
#export READQC=$(pwd)/SRP033351_read_qc-Results
 
# trimmed data
#export RAWDATA=$(pwd)/trimmed-SRP033351_fastq
#export READQC=$(pwd)/trimmed-SRP033351_read_qc-Results
 
# chr22 data
export RAWDATA=$(pwd)/SRR1039509-chr22_fastq
export READQC=$(pwd)/SRR1039509-chr22_read_qc-Results
 
mkdir -p $READQC
 
# create an empty error log
cat /dev/null > $READQC/error.log
 
for fq in $RAWDATA/*.fastq.gz; do
 
	name=${fq##*/}
	prefix=${name%%.fastq.gz}
 
	# perform a full QC control on the reads (-q for quiet, 4 threads)
	# then convert results to PDF
	echo "# run fastqc on ${fq}"
	cmd="fastqc -f fastq --noextract -t 4 -o $READQC -q ${fq}"
 
	echo "# ${cmd}"
	eval ${cmd}
 
	## add conversion to PDF
	# htmldoc --webpage -f ${prefix}_fastqc.pdf ${prefix}_fastqc.html
 
	# generate statistics for plots
	# secret option '-Q 33' was added to fit with the phred scale
	echo "# checking: ${name}"
	zcat ${fq} | fastx_quality_stats -Q 33 -o $READQC/${name}_stats.txt \
		2>> $READQC/error.log
 
	# plot from the text summary file
	fastq_quality_boxplot_graph.sh \
		-i $READQC/${name}_stats.txt \
		-o $READQC/${prefix}_boxplot.png
 
	fastx_nucleotide_distribution_graph.sh \
		-i $READQC/${name}_stats.txt \
		-o $READQC/${prefix}_nuclplot.png
 
	# also plot normalized base frequency using R (script code in appendix)
	scripts/avgQdist2linePlot.R $READQC/${name}_stats.txt $READQC
 
done
 
# convert results to a nice PDF file
# requires htmldoc and dependencies installed on your machine
# htmldoc --webpage \
#     --browserwidth 800 \
#     --fontsize 7 \
#     -f ${outfolder}/fastqc_report.pdf \
#     ${outfolder}/fastqc_report.html
FastQC plots for SRR1039509_1 reads PDF
per_base_quality.png
per_sequence_quality.png
per_base_sequence_content.png
adapter_content.png
per_sequence_gc_content.png
per_base_n_content.png
sequence_length_distribution.png
duplication_levels.png
kmer_profiles.png
per_tile_quality.png
FastQC plots for SRR1039509_2 reads PDF
per_base_quality.png
per_sequence_quality.png
per_base_sequence_content.png
adapter_content.png
per_sequence_gc_content.png
per_base_n_content.png
sequence_length_distribution.png
duplication_levels.png
kmer_profiles.png
per_tile_quality.png
FastQC plots for chr22-SRR1039509_1 reads PDF
per_base_quality.png
per_sequence_quality.png
per_base_sequence_content.png
adapter_content.png
per_sequence_gc_content.png
per_base_n_content.png
sequence_length_distribution.png
duplication_levels.png
kmer_profiles.png
FastQC plots for chr22-SRR1039509_2 reads PDF
per_base_quality.png
per_sequence_quality.png
per_base_sequence_content.png
adapter_content.png
per_sequence_gc_content.png
per_base_n_content.png
sequence_length_distribution.png
duplication_levels.png
kmer_profiles.png

Map the paired FastQ reads to the human genome & transcriptome

The raw FastQ files obtained from the SRA data were mapped directly using Tophat2 and the associated Bowtie2 program without read trimming and cleaning. The effect of trimming on read mapping was analyzed on one sample and found to be relative as presented in NGS_RNASeq_DE_Exercise.1c.

Technical.png We used here the latest versions of Tophat2 and bowtie2 as opposed to what is often done using Galaxy or reported in relatively outdated pages, our results will therefore slightly differ from those published originally by the authors

The mapping was done both against the entire hg19 genome sequence (run#1) or against the defined transcriptome as present in the Ensembl GTF model (run#2). Both methods are equally valid when users ultimately consider known genes and their documented transcripts (which is what we will do in this session) but the first method may present additional advantages when the user wish to discover new splice variants (not covered in this training) or exon junctions between separate genes as a result of gene fusions (not covered in this training).

Technical.png The following scripts take each pair of FastQ files and performs mapping either against the full genome ('all') or against the transcript models ('gtf') and using the maximal capacity of the BITS laptops (1 thread and 4GB RAM). Note that if you use this code on a higher-end machine, you will need to edit the parameters controlling for memory usage and number of threads accordingly

tophat_map-all.sh script

#!/bin/bash
# tophat_map-all.sh
 
# the script should be called from the base folder
# the base folder includes the 'reads' folder named below
 
## full data
# reads="SRP033351_fastq"
 
## one sample
#reads="SRR1039509_fastq"
 
## one sample chr22 subset
reads="SRR1039509-chr22_fastq"
 
results="tophat2.0.13_results"
mkdir -p ${results}
 
# folder and file options
fasta=$BOWTIE2_INDEXES/hg19.fa
index=$BOWTIE2_INDEXES/hg19
gtffile=$BOWTIE2_INDEXES/hg19_ensGene.gtf
gtfindex=$BOWTIE2_INDEXES/hg19_ensGene
 
# how hard can your computer work? (max cpu# that may be used)
nthr=1 # using 'nthr' processors
maxram="4G"
 
# on a stronger, this will run faster
# nthr=24 # using 'nthr' processors
# maxram="64G"
 
# map single ends in (un-guided|guided) modes (not|using) the gtf file
for fq1 in $(ls ${reads}/*_1.fastq.gz); do
 
 # get the base-name without end_suffix
 name=$(basename "${fq1}" "_1.fastq.gz")
 echo
 echo "# mapping the ${name} pair against the whole genome"
 
 # deduce second read file name
 fq2=$(echo "$fq1" | sed -r 's/_1.fastq.gz/_2.fastq.gz/')
 
 # unguided mode
 # execute the command and write summary results in a 'log' file
 cmd="tophat -p ${nthr} \
 	-o ${results}/${name}_all-mappings \
 	--no-coverage-search \
 	${index} \
 	${fq1} ${fq2} \
 	2>&1 | tee -a ${results}/tophat-all-log_${name}.txt"
 
 echo "# ${cmd}"
 eval ${cmd}
 
 # if tophat was a success, sort, rename, and index mapping results
 # run in background with good resource while starting the next loop on 1 thread
 if [ $? = 0 ]; then
  # use 4 threads and up to 4GB RAM (default are 1 thread and 768M)
  samtools sort -@ ${nthr} -m ${maxram} \
  	${results}/${name}_all-mappings/accepted_hits.bam \
  	${results}/${name}_all-tophat \
	&& samtools index ${results}/${name}_all-tophat.bam
 
  # also perform basic BAM QC on mappings
  # the code is provided in a separate script 'mapping_metrics.sh'
  scripts/mapping_metrics.sh ${results}/${name}_all-tophat.bam
 fi
 
 echo
 
# print separator
A=$(printf "%50s\n")
echo ${A// /#}
echo
 
# abort after first loop debug-mode
# exit 0
 
done

tophat_map-gtf.sh script

#!/bin/bash
# tophat_map-all.sh
 
# the script should be called from the base folder
# the base folder includes the 'reads' folder named below
 
## one sample chr22 subset
reads="SRR1039509-chr22_fastq"
 
results="tophat2.0.13_results_gtf"
mkdir -p ${results}
 
# folder and file options
fasta=$BOWTIE2_INDEXES/hg19.fa
index=$BOWTIE2_INDEXES/hg19
gtffile=$BOWTIE2_INDEXES/hg19_ensGene.gtf
gtfindex=$BOWTIE2_INDEXES/hg19_ensGene
 
# how hard can your computer work? (max cpu# that may be used)
nthr=1 # using 'nthr' processors
maxram="4G"
 
# on a stronger, this will run faster
# nthr=24 # using 'nthr' processors
# maxram="64G"
 
# map single ends in (un-guided|guided) modes (not|using) the gtf file
for fq1 in $(ls ${reads}/*_1.fastq.gz); do
 
 # get the base-name without end_suffix
 name=$(basename "${fq1}" "_1.fastq.gz")
 echo
 echo "# mapping the ${name} pair against the whole genome"
 
 # deduce second read file name
 fq2=$(echo "$fq1" | sed -r 's/_1.fastq.gz/_2.fastq.gz/')
 
 # guided mode with 'ensGene' as model
 # execute the command and write summary results in a 'log' file
 # first time -G ${gtffile} then --transcriptome-index ${gtfindex}
 cmd="tophat \
 	-p ${nthr} \
 	-o ${results}/${name}_gtf-mappings \
 	--transcriptome-index ${gtfindex} \
 	--no-coverage-search \
 	${index} \
 	${fq1} ${fq2} \
 	2>&1 | tee -a ${results}/tophat-gtf-log_${name}.txt"
 
 echo "# ${cmd}"
 eval ${cmd}
 
 # if tophat was a success, sort, rename, and index mapping results
 # run in background with good resource while starting the next loop on 1 thread
 if [ $? = 0 ]; then
  # use 4 threads and up to 4GB RAM (default are 1 thread and 768M)
  samtools sort -@ ${nthr} -m ${maxram} \
  	${results}/${name}_gtf-mappings/accepted_hits.bam \
  	${results}/${name}_gtf-tophat \
	&& samtools index ${results}/${name}_gtf-tophat.bam
 
  # also perform basic BAM QC on mappings
  # the code is provided in a separate script 'mapping_metrics.sh'
  scripts/mapping_metrics.sh ${results}/${name}_gtf-tophat.bam
 fi
 
 echo
 
 # print separator
 A=$(printf "%50s\n")
 echo ${A// /#}
 echo
 
done

Quick evaluation of the SRR1039509 mapping results for both reference models

Several rapid QC steps were added in the mapping loop to count mapped and mated read obtained with chr22 data from the N61311_Dex (SRR1039509) sample.

mapping_metrics.sh script

#! /usr/bin/env bash
## script: 'mapping_metrics.sh'
## ©SP-BITS, 2015-01-05
 
# required:
# samtools 1.2
# Picard Version (broad build): 1.129
 
maxmem=4G
 
infile=$1
filename=$(basename ${infile})
outbase="${infile:0:${#infile} - ${#filename}}"
 
## samtools flagstats ##
cmd="samtools flagstat \
	${infile} \
	>${infile%%.bam}_flagstat.txt"
 
echo
echo "# ${cmd}"
eval ${cmd}
 
## samtools stats ##
cmd="samtools stats \
	${infile} \
	>${infile%%.bam}_stats.txt"
 
echo
echo "# ${cmd}"
eval ${cmd}
 
## samtools plot-bamstats ##
cmd="plot-bamstats \
	${infile%%.bam}_stats.txt \
	-p ${infile%%.bam}_plots/"
 
echo
echo "# ${cmd}"
eval ${cmd}
 
## plot mapping quality score distribution
cmd="java -Xmx${maxmem} -jar $PICARD/picard.jar \
	QualityScoreDistribution \
	CHART=${infile%%.bam}-QS_distribution.pdf \
	I=${infile} \
	O=${infile%%.bam}-mappings_QS.txt"
 
echo
echo "# ${cmd}"
eval ${cmd}
samtools plot-bamstats plots for N61311_Dex (SRR1039509) and full genome mapping
insert-size.png
gc-content.png
acgt-cycles.png
coverage.png
quals.png
quals2.png
quals3.png
quals-hm.png
indel-cycles.png
indel-dist.png
gc-depth.png
samtools plot-bamstats plots for N61311_Dex (SRR1039509) and ensGene mapping
insert-size.png
gc-content.png
acgt-cycles.png
coverage.png
quals.png
quals2.png
quals3.png
quals-hm.png
indel-cycles.png
indel-dist.png
gc-depth.png

Other tools are available to analyze and compare BAM data and are covered in the next exercise. Furthermore, a better evaluation will be possible after converting mapping data (BAM) to gene counts as detailed in a later exercise.

download exercise files

Download exercise files here.

Use the right application to open the files present in ex2-files

References:

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 | NGS_RNASeq_DE_Exercise.3 ]