NGS RNASeq DE Exercise.3

From BITS wiki
Jump to: navigation, search

Control mappings with RSeQC or Qualimap and mark Duplicates with Picard

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.2 | NGS_RNASeq_DE_Exercise.4 ]


Introduction

As for FastQ reads, it is vital to control the quality of mapping results before proceeding with the RNASeq workflow. The mapping to a reference genome has now sorted out the reads and it becomes possible to identify issues related to extreme or absent coverage in some regions of the genome as well as identify duplicate reads that may originate from library preparation biases (PCR duplicates) or from scanning artifacts during Illumina sequencing (optical duplicates).

QC of mapping results

Mapping will reveal artifacts like RNA degradation or unexpected sequence content more precisely than the analysis of raw FastQ data (fastQC). Several program exist to perform QC of mappings (BAM) data; we selected here RSeQC and QualiMap that return partly overlapping results and can both be used under at command line under unix OS or with GUI on several operating systems (Java).

Comparing the 16 chr22-only gtf-BAM files with deeptools

We should first check for the presence of a batch effect dividing sample into groups by computing pairwise correlations between BAM files with deeptools bamCorrelate[1][2]. The expected result in the absence of batch effect should be a full correlation between all samples as the global transcript expression profile should not vary significantly (>80% of genes are not DE).

deeptool_bamCorrelate-all.sh script

#!/bin/bash
# deeptool_bamCorrelate_chr22-all.sh
 
# global correlation
# we now compare all BAM files at once to look for clustering between replicates
# we extract only chr22 mappings and compare to ensGene loci
 
# required:
# deeptool bamCorrelate all 16 files (chr22 only and from 'all' mappings)
# https://github.com/fidelram/deepTools/wiki/QC#bamCorrelate
# https://github.com/fidelram/deepTools/wiki/All-command-line-options
 
# speedup with --numberOfProcessors nthr matching your cpu#
nthr=24
 
bamfolder=SRR1039509-chr22_bam
results=bamCorrelate_chr22-all
mkdir -p ${results}
 
### DOWNLOAD and expand reference transcript models ###
# get reference files from the RSeQC server if absent locally
mkdir -p ref
# refseq models
if [[ ! -f ref/hg19_RefSeq.bed.gz ]]; then
 wget -P ref \
	http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_RefSeq.bed.gz \
	&& zgrep "^chr22" ref/hg19_RefSeq.bed.gz > ref/chr22-hg19_RefSeq.bed
fi
# ensembl models
if [[ ! -f ref/hg19_Ensembl.bed.gz ]]; then
 wget -P ref \
	http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_Ensembl.bed.gz \
	&& zgrep "^chr22" ref/hg19_Ensembl.bed.gz > ref/chr22-hg19_Ensembl.bed
fi
 
# we choose the Ensembl standard
bedfile="ref/chr22-hg19_Ensembl.bed"
 
# create bam file list
bamfiles=$(ls ${bamfolder}/*.bam)
# deduce labels from files
l1=${bamfiles//${bamfolder}\/chr22_/}
labels=${l1//_all.bam/}
 
# build the deeptool command
corM=spearman # spearman or pearson
 
 
cmd="bamCorrelate bins \
	--bamfiles ${bamfiles} \
	--labels ${labels} \
	--fragmentLength 200 \
	--numberOfProcessors ${nthr} \
	--outFileCorMatrix ${results}/correlation_matrix.txt \
	--binSize 1000 \
	--corMethod ${corM} \
	-o ${results}/correlation_${corM}.pdf \
	2>&1 | tee -a ${results}/bamCorrelate_log.txt"
 
echo "# ${cmd}"
eval ${cmd}
bamCorrelate results
correlation_spearman.png

RSeQC analysis of the BAM data

This toolbox is written in python and C and can be installed more or less easily on any system. It provides very useful tools that can tell a lot about biases and issues with the RNASeq data[3]. As always, we strongly advise to carefully check QC of each step of the workflow that will impact on the follow-up steps. The RSeQC toolbox documentation can be found here[4].

The RSeQC script for one BAM file is detailed below to successively apply several RSeQC commands.

RSeQC_chr22-bam.sh script

#!/bin/bash
# RSeQC_chr22-bam.sh
 
# takes a BAM file from $1 and performs several RSeQC analyses
if [ $# -lt 1 ]; then
	echo "Usage: ${0##*/} <BAM file>"
	exit
fi
 
# process only one BAM file
# batch can be done under bash with a for loop
bam=$1
 
# help: batch apply this script to all bams with
## for b in markdup_bams/*.bam; do \
##   echo "# analyzing ${b}" \
##   scripts/RSeQC_bam.sh $b\
##   done
 
qc_results="RSeQC_chr22-results"
mkdir -p ${qc_results}
 
echo "# analysing: ${bam}"
name=$(basename "${bam%%_mdup.bam}")
 
### DOWNLOAD and expand reference transcript models ###
# get reference files from the RSeQC server if absent locally
mkdir -p ref
 
# refseq models
if [[ ! -f ref/hg19_RefSeq.bed.gz ]]; then
	wget -P ref \
	http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_RefSeq.bed.gz
fi
 
# ensembl models
if [[ ! -f ref/hg19_Ensembl.bed.gz ]]; then
	wget -P ref \
	http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_Ensembl.bed.gz
fi
 
# we choose the EnsEMBL reference model
orirefgene="hg19_Ensembl.bed.gz"
refgene="hg19_ensGene.bed"
 
# expand ensGene and sort chromosome records in ascending order
# and create chr22 subset
gzip -cd ref/${orirefgene} | \
	sort -k 1V,1 -k2n,2 -k 3n,3 > ref/${refgene} &&
	grep "^chr22" ref/${refgene} > ref/chr22-${refgene}
 
##############################
# perform selected RSeQC tests
# more tools exist, please refer to the online documentation.
 
# select which reference transcriptome to use
# or chr22 subset
selref=ref/chr22-${refgene}
#selref=ref/chr22-${refgene}
 
# prefix to save results
res=${qc_results}/${name}
 
# build the 'infer_experiment' command
# speculate how RNA-seq sequencing were configured
cmd="infer_experiment.py \
	-r ${selref} \
	-i ${bam} \
	-s 200000 \
	> ${res}_inferred.txt \
	2>&1 | tee -a ${qc_results}/RSeQC-${name}_log.txt"
 
echo "# ${cmd}"
eval ${cmd}
 
# build the 'inner_distance.py' command
# calculate the inner distance (or insert size) between two paired RNA reads
# The inner_distance might be a negative value if paired reads do overlap.
cmd="inner_distance.py \
	-r ${selref} \
	-i ${bam} \
	-o ${res}_inner_distance \
	-k 1000000 \
	-l -250 \
	-u 250 \
	-s 5 \
	-q 30 \
	2>&1 | tee -a ${qc_results}/RSeQC-${name}_log.txt"
 
echo "# ${cmd}"
eval ${cmd}
 
# build the 'junction_saturation' command
# check if sequencing depth is deep enough for alternative splicing analyses.
cmd="junction_saturation.py \
	-r ${selref} \
	-i ${bam} \
	-o ${res} \
	-l 5 \
	-u 100 \
	-m 50 \
	-v 1"
 
echo "# ${cmd}"
eval ${cmd}
 
# build the 'geneBody_coverage' command
# compute coverage for all transcript models in percentiles
cmd="geneBody_coverage.py \
	-r ${selref} \
	-i ${bam} \
	-o ${res} \
	2>&1 | tee -a ${qc_results}/RSeQC-${name}_log.txt"
 
echo "# ${cmd}"
eval ${cmd}
 
# determine reads duplication rate with 'read_duplication.py'
cmd="read_duplication.py \
	-i ${bam} \
	-o ${res} \
	-u 500 \
	2>&1 | tee -a ${qc_results}/RSeQC-${name}_log.txt"
 
echo "# ${cmd}"
eval ${cmd}
 
echo "# end of RseQC analysis"
 
# rem: the following 'ImageMagick' command converts all PDF outputs to PNG
# and replaces internal '.' that would disturb some packages by '_'
# not run:
# for i in *.pdf; do name=${i/./_}; convert "$i" "${name/.pdf}".png; done

The processing of each BAM file takes some time and we have a reasonably large number of files to test. When the computer allows it becomes in such case quite interesting to use parallel processing. The first piece of code below starts parallel jobs for each BAM file fond in the mapping folder. Results are collected and stored by the script shown next.

Using GNU parallel to run 10 concurrent jobs

mappings=tophat2.0.13_results/markdup_bams
find ${mappings}/*_mdup.bam | \
    parallel --no-notice --tmpdir /tmp -j 10 'scripts/RSeQC_bam.sh {}'

The results for SRR1039509_all_mdup.bam are reported as illustration

  • infer_experiment.py (read more here)
This is PairEnd Data
Fraction of reads failed to determine: 0.1137
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4405
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4458
  • junction_saturation.py: It’s very important to check if current sequencing depth is deep enough to perform alternative splicing analyses. For a well annotated organism, the number of expressed genes in particular tissue is almost fixed so the number of splice junctions is also fixed. The fixed splice junctions can be predetermined from reference gene model. All (annotated) splice junctions should be rediscovered from a saturated RNA-seq data, otherwise, downstream alternative splicing analysis is problematic because low abundance splice junctions are missing. This module checks for saturation by resampling 5%, 10%, 15%, ..., 95% of total alignments from BAM or SAM file, and then detects splice junctions from each subset and compares them to reference gene model. (read more here)
  • geneBody_coverage.py: Read coverage over gene body. This module is used to check if reads coverage is uniform and if there is any 5’ or 3’ bias. (read more here)
  • read_duplication.py: Two strategies were used to determine reads duplication rate: * Sequence based: reads with identical sequence are regarded as duplicated reads. * Mapping based: reads mapped to the exactly same genomic location are regarded as duplicated reads. For splice reads, reads mapped to the same starting position and splice the same way are regarded as duplicated reads.
  • inner_distance.py: This module is used to calculate the inner distance (or insert size) between two paired RNA reads. The distance is the mRNA length between two paired fragments. We first determine the genomic (DNA) size between two paired reads: D_size = read2_start - read1_end, then ... (read more here)
RSeQC plots for SRR1039509 mappings
SRR1039509_all.junctionSaturation_plot.png
SRR1039509_all.geneBodyCoverage.curves.png
SRR1039509_all.DupRate_plot.png
SRR1039509_all_inner_distance.inner_distance_plot.png
RSeQC plots for chr22_SRR1039509 mappings
chr22_SRR1039509_all.bam.junctionSaturation_plot.png
chr22_SRR1039509_all.bam.geneBodyCoverage.curves.png
chr22_SRR1039509_all.bam.DupRate_plot.png
chr22_SRR1039509_all.bam_inner_distance.inner_distance_plot.png

Alternatively you can process all BAM files together with geneBody_coverage.py to obtain a single plot and identify faulty libraries.

bamdata=SRP033351-chr22_bam
 
# create folder for results
results=geneBody_coverage-chr22
mkdir -p ${results}
 
# create a list with all BAM files, one per line
ls ${bamdata}/*.bam | paste -s -d"\n" > ${results}/list.txt
 
# apply geneBody_coverage.py in batch-mode
cmd="geneBody_coverage.py \
    -r ref/chr22-hg19_Ensembl.bed \
    -i ${results}/list.txt  \
    -o ${results}/batch-geneBody_all"
 
echo "# ${cmd}"
eval ${cmd}
batch geneBody_coverage.py plots
batch-geneBody_all.geneBodyCoverage.curves.png
batch-geneBody_all.geneBodyCoverage.heatMap.png

 

Qualimap for the QC of BAM data

This recent Java toolbox [5] is another interesting alternative to obtain QC data on BAM files [6] presents a very attractive GUI, very similar to that of fastQC and complemented with the CLI version of each command to allow scripting and implementations at command line. We processed below one BAM file to illustrate the capabilities of Qualimap. Please refer to the documentation if you an to use it with more customization.

Qualimap_chr22-bam.sh script

#!/bin/bash
# Qualimap_chr22-bam.sh
# bam needs to be sorted by coordinate
 
# takes a BAM file from $1 and performs several qualimap analyses
if [ $# -lt 1 ]; then
	echo "Usage: ${0##*/} <BAM file>"
	exit
fi
 
# locate your qualimap installation folder
QUALIMAP=$BIOTOOLS/qualimap
 
# process only one BAM file
# batch can be done under bash with a for loop
bam=$1
echo "# analysing: ${bam}"
name=$(basename "${bam%%_all-tophat.bam}")
 
# create folders
qc_reports="Qualimap_results/chr22-reports/${name}"
mkdir -p ${qc_reports}
 
qc_counts="Qualimap_results/chr22-counts"
mkdir -p ${qc_counts}
 
# we choose the EnsEMBL reference model
#gtffile=$BOWTIE2_INDEXES/hg19_ensGene.gtf
 
# but we also have a chr22 subset file
gtffile=$BOWTIE2_INDEXES/chr22-hg19_ensGene.gtf
 
##############################
# perform selected Qualimap tests
# more tools exist, please refer to the online documentation.
 
# default names
sbam=${bam}
qbam=${bam}
 
# test if bam is query or coordinate-sorted
order="$(samtools view -H ${bam} | head -1 | awk '{split($3, ord, ":"); print ord[2]}')"
 
if [ "$order" = "coordinate" ]; then
 
	echo "# sorting in queryname order"
	# redefine qbam
	qbam=${bam//.bam/-q.bam}
	cmd="java -jar $PICARD/picard.jar SortSam \
		I=${bam} \
		O=${qbam} \
		SO=queryname"
	eval ${cmd}
 
else
 
	echo "# sorting in coordinate order"
	# redefine sbam
	sbam=${bam//.bam/-s.bam}
	cmd="java -jar $PICARD/picard.jar SortSam \
		I=${bam} \
		O=${sbam} \
		SO=coordinate"
	eval ${cmd}
 
fi
 
#echo "# ${cmd}"
#eval ${cmd}
 
# run the BAMQC tool on coordinate-sorted data
cmd="$QUALIMAP/qualimap bamqc \
 	-p non-strand-specific \
 	-bam ${sbam} \
 	-c \
 	-gd HUMAN \
 	-gff ${gtffile} \
 	-hm 3 \
 	-nr 1000 \
 	-nt 4 \
 	-nw 400 \
 	-outdir ${qc_reports} \
 	-outformat PDF \
 	-outfile ${qc_reports}/${name}-Qualimap-bamqc-report.pdf \
	2>&1 | tee -a ${qc_reports}/qualimap-bamqc_log.txt"
 
echo "# ${cmd}"
eval ${cmd}
 
# print separator
A=$(printf "%50s\n")
echo ${A// /#}
echo
 
# run the RNASEQ tool on queryname-sorted data
cmd="$QUALIMAP/qualimap rnaseq \
 	-p non-strand-specific \
 	-pe \
 	-bam ${qbam} \
 	-a uniquely-mapped-reads \
 	-gtf ${gtffile} \
  	-outdir ${qc_reports} \
 	-oc ${qc_counts}/${name}_counts.txt \
 	-outformat PDF \
 	-outfile ${qc_reports}/${name}-Qualimap-rnaseq-report.pdf \
	2>&1 | tee -a ${qc_reports}/qualimap-rnaseq_log.txt"
 
echo "# ${cmd}"
eval ${cmd}
 
# deleted leftover sorted BAM
if [ -f ${bam//.bam/-s.bam} ]; then
	rm ${bam//.bam/-s.bam}
fi
 
if [ -f ${bam//.bam/-s.bam} ]; then
	rm ${bam//.bam/-q.bam}
fi
 
echo "# end of Qualimap analysis"

Example Qualimap analyses results for chr22_CON_24_1_all-tophat.bam (considering only chr22 EnsGene regions): chr22_SRR1039509-Qualimap-bamqc-report.pdf; chr22_SRR1039509-Qualimap-rnaseq-report.pdf

Other tools for QC of mapping data

  • A series of new tools have been recently added to the Picard toolbox[7], please consider evaluating these tools when you will have your own data. Picard tools constitute the mainframe of the very popular GATK platform [8] and are really hot [9].

 

Removing or Marking read duplicates

Do we need it?

The two classes of duplicates will be identified by Picard MarkDuplicates based on their sequence (PCR duplicates) and on their address on the Illumina slide (optical duplicates). In RNASeq analysis, it is not advisable to physically remove duplicate reads as they might in part be the result of cDNA synthesis of distinct mRNA molecules. We therefore limit the duplicate analysis to marking the reads so that next steps in the workflow may take their status into account.

Marking duplicates with Picard MarkDuplicates

bam_markdups.sh script

#!/bin/bash
# bam_markdups.sh
 
# mark optical and PCR duplicates from 'SO:coordinate'-sorted BAM file(s)
# re-sort output by read name ('SO:queryname')
# requires picard
 
# allocate as much RAM as your computer can
maxram=4G
 
# mark optical and PCR duplicates from position-sorted BAM file(s)
infolder="tophat2.0.13_results"
outfolder=${infolder}/markdup_bams
mkdir -p ${outfolder}
 
for b in ${infolder}/*_all-tophat.bam; do
 
	echo "# processing ${b}"
	# get prefix
	name=$(basename ${b} "-tophat.bam")
 
	# mark duplicates and index result
	cmd="java -Xmx${maxram} -jar $PICARD/picard.jar MarkDuplicates \
		I=${b} \
		O=${outfolder}/${name}_mdup.bam \
		M=${outfolder}/${name}_mdup.txt \
		AS=TRUE \
		2>&1 | tee -a ${outfolder}/MarkDuplicates_${name}_log.txt"
 
	echo "# ${cmd}"
	eval ${cmd}
 
	# re-sort by querynames to comply with HTSeq-count
	# if last command did exit without error ($? -eq 0)
	if [ $? -eq 0 ]; then
		cmd="java -Xmx${maxram} -jar $PICARD/picard.jar SortSam \
			I=${outfolder}/${name}_mdup.bam \
			O=${outfolder}/s${name}_mdup.bam \
			SO=queryname \
			CREATE_INDEX=TRUE \
			VALIDATION_STRINGENCY=LENIENT \
			2>&1 | tee -a ${outfolder}/MarkDuplicates_${name}_log.txt"
 
		echo "# ${cmd}"
		# delete original file if picard exists successfully
		eval ${cmd} && mv ${outfolder}/s${name}_mdup.bam \
		${outfolder}/${name}_mdup.bam
 
	else
		echo "# something went wrong!"
		exit 1
	fi
 
	# output results to screen from text file
	echo "# MarkDuplicates results for ${name}"
	head ${outfolder}/${name}_mdup.txt | grep -v "^#" | transpose -t | column -t
	echo
 
	# exit after one for debug
	#exit 0
 
done

The summary counts are located in line 7 and 8 of current log files. To get this data shown as a nice two-column table, use any of the following codelets:

# in unix you often find many ways to do the same thing, pick your favorite!
head -8 SRR1039509_all_mdup.txt | tail -2 | transpose -t | column -t -s $'\t'
awk 'NR >= 7 && NR <= 8' SRR1039509_all_mdup.txt | transpose -t | column -t -s $'\t'
sed -n 7,8p SRR1039509_all_mdup.txt | transpose -t | column -t -s $'\t'
perl -ne 'print if 7..8' SRR1039509_all_mdup.txt | transpose -t | column -t -s $'\t'

Picard MarkDuplicate results for the SRR1039509_all sample

head tophat2.0.13_results/markdup_bams/chr22_SRR1039509_all_mdup.txt | grep -v "^#" | transpose -t | column -t
 
LIBRARY                       Unknown   Library
UNPAIRED_READS_EXAMINED       951
READ_PAIRS_EXAMINED           388897
UNMAPPED_READS                0
UNPAIRED_READ_DUPLICATES      256
READ_PAIR_DUPLICATES          70056
READ_PAIR_OPTICAL_DUPLICATES  0
PERCENT_DUPLICATION           0.180249
ESTIMATED_LIBRARY_SIZE        945478

 

download exercise files

Download exercise files here.

Use the right application to open the files present in ex3-files

References:
  1. Fidel Ramírez, Friederike Dündar, Sarah Diehl, Björn A Grüning, Thomas Manke
    deepTools: a flexible platform for exploring deep-sequencing data.
    Nucleic Acids Res.: 2014, 42(Web Server issue);W187-91
    [PubMed:24799436] ##WORLDCAT## [DOI] (I p)

  2. https://github.com/fidelram/deepTools/wiki/All-command-line-options
  3. Liguo Wang, Shengqin Wang, Wei Li
    RSeQC: quality control of RNA-seq experiments.
    Bioinformatics: 2012, 28(16);2184-5
    [PubMed:22743226] ##WORLDCAT## [DOI] (I p)

  4. http://rseqc.sourceforge.net/
  5. http://qualimap.bioinfo.cipf.es/
  6. Fernando García-Alcalde, Konstantin Okonechnikov, José Carbonell, Luis M Cruz, Stefan Götz, Sonia Tarazona, Joaquín Dopazo, Thomas F Meyer, Ana Conesa
    Qualimap: evaluating next-generation sequencing alignment data.
    Bioinformatics: 2012, 28(20);2678-9
    [PubMed:22914218] ##WORLDCAT## [DOI] (I p)

  7. http://broadinstitute.github.io/picard/
  8. Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, Mark A DePristo
    The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
    Genome Res.: 2010, 20(9);1297-303
    [PubMed:20644199] ##WORLDCAT## [DOI] (I p)

  9. https://www.broadinstitute.org/gatk/about/cited-by

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.2 | NGS_RNASeq_DE_Exercise.4 ]