NGS RNASeq DE Exercise.3
Control mappings with RSeQC or Qualimap and mark Duplicates with Picard
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.2 | NGS_RNASeq_DE_Exercise.4 ]
Contents
Introduction
As for FastQ reads, it is vital to control the quality of mapping results before proceeding with the RNASeq workflow. The mapping to a reference genome has now sorted out the reads and it becomes possible to identify issues related to extreme or absent coverage in some regions of the genome as well as identify duplicate reads that may originate from library preparation biases (PCR duplicates) or from scanning artifacts during Illumina sequencing (optical duplicates).
QC of mapping results
Mapping will reveal artifacts like RNA degradation or unexpected sequence content more precisely than the analysis of raw FastQ data (fastQC). Several program exist to perform QC of mappings (BAM) data; we selected here RSeQC and QualiMap that return partly overlapping results and can both be used under at command line under unix OS or with GUI on several operating systems (Java).
Comparing the 16 chr22-only gtf-BAM files with deeptools
We should first check for the presence of a batch effect dividing sample into groups by computing pairwise correlations between BAM files with deeptools bamCorrelate[1][2]. The expected result in the absence of batch effect should be a full correlation between all samples as the global transcript expression profile should not vary significantly (>80% of genes are not DE).
deeptool_bamCorrelate-all.sh script
#!/bin/bash # deeptool_bamCorrelate_chr22-all.sh # global correlation # we now compare all BAM files at once to look for clustering between replicates # we extract only chr22 mappings and compare to ensGene loci # required: # deeptool bamCorrelate all 16 files (chr22 only and from 'all' mappings) # https://github.com/fidelram/deepTools/wiki/QC#bamCorrelate # https://github.com/fidelram/deepTools/wiki/All-command-line-options # speedup with --numberOfProcessors nthr matching your cpu# nthr=24 bamfolder=SRR1039509-chr22_bam results=bamCorrelate_chr22-all mkdir -p ${results} ### DOWNLOAD and expand reference transcript models ### # get reference files from the RSeQC server if absent locally mkdir -p ref # refseq models if [[ ! -f ref/hg19_RefSeq.bed.gz ]]; then wget -P ref \ http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_RefSeq.bed.gz \ && zgrep "^chr22" ref/hg19_RefSeq.bed.gz > ref/chr22-hg19_RefSeq.bed fi # ensembl models if [[ ! -f ref/hg19_Ensembl.bed.gz ]]; then wget -P ref \ http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_Ensembl.bed.gz \ && zgrep "^chr22" ref/hg19_Ensembl.bed.gz > ref/chr22-hg19_Ensembl.bed fi # we choose the Ensembl standard bedfile="ref/chr22-hg19_Ensembl.bed" # create bam file list bamfiles=$(ls ${bamfolder}/*.bam) # deduce labels from files l1=${bamfiles//${bamfolder}\/chr22_/} labels=${l1//_all.bam/} # build the deeptool command corM=spearman # spearman or pearson cmd="bamCorrelate bins \ --bamfiles ${bamfiles} \ --labels ${labels} \ --fragmentLength 200 \ --numberOfProcessors ${nthr} \ --outFileCorMatrix ${results}/correlation_matrix.txt \ --binSize 1000 \ --corMethod ${corM} \ -o ${results}/correlation_${corM}.pdf \ 2>&1 | tee -a ${results}/bamCorrelate_log.txt" echo "# ${cmd}" eval ${cmd}
bamCorrelate results | |
---|---|
RSeQC analysis of the BAM data
This toolbox is written in python and C and can be installed more or less easily on any system. It provides very useful tools that can tell a lot about biases and issues with the RNASeq data[3]. As always, we strongly advise to carefully check QC of each step of the workflow that will impact on the follow-up steps. The RSeQC toolbox documentation can be found here[4].
The RSeQC script for one BAM file is detailed below to successively apply several RSeQC commands.
RSeQC_chr22-bam.sh script
#!/bin/bash # RSeQC_chr22-bam.sh # takes a BAM file from $1 and performs several RSeQC analyses if [ $# -lt 1 ]; then echo "Usage: ${0##*/} <BAM file>" exit fi # process only one BAM file # batch can be done under bash with a for loop bam=$1 # help: batch apply this script to all bams with ## for b in markdup_bams/*.bam; do \ ## echo "# analyzing ${b}" \ ## scripts/RSeQC_bam.sh $b\ ## done qc_results="RSeQC_chr22-results" mkdir -p ${qc_results} echo "# analysing: ${bam}" name=$(basename "${bam%%_mdup.bam}") ### DOWNLOAD and expand reference transcript models ### # get reference files from the RSeQC server if absent locally mkdir -p ref # refseq models if [[ ! -f ref/hg19_RefSeq.bed.gz ]]; then wget -P ref \ http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_RefSeq.bed.gz fi # ensembl models if [[ ! -f ref/hg19_Ensembl.bed.gz ]]; then wget -P ref \ http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/hg19_Ensembl.bed.gz fi # we choose the EnsEMBL reference model orirefgene="hg19_Ensembl.bed.gz" refgene="hg19_ensGene.bed" # expand ensGene and sort chromosome records in ascending order # and create chr22 subset gzip -cd ref/${orirefgene} | \ sort -k 1V,1 -k2n,2 -k 3n,3 > ref/${refgene} && grep "^chr22" ref/${refgene} > ref/chr22-${refgene} ############################## # perform selected RSeQC tests # more tools exist, please refer to the online documentation. # select which reference transcriptome to use # or chr22 subset selref=ref/chr22-${refgene} #selref=ref/chr22-${refgene} # prefix to save results res=${qc_results}/${name} # build the 'infer_experiment' command # speculate how RNA-seq sequencing were configured cmd="infer_experiment.py \ -r ${selref} \ -i ${bam} \ -s 200000 \ > ${res}_inferred.txt \ 2>&1 | tee -a ${qc_results}/RSeQC-${name}_log.txt" echo "# ${cmd}" eval ${cmd} # build the 'inner_distance.py' command # calculate the inner distance (or insert size) between two paired RNA reads # The inner_distance might be a negative value if paired reads do overlap. cmd="inner_distance.py \ -r ${selref} \ -i ${bam} \ -o ${res}_inner_distance \ -k 1000000 \ -l -250 \ -u 250 \ -s 5 \ -q 30 \ 2>&1 | tee -a ${qc_results}/RSeQC-${name}_log.txt" echo "# ${cmd}" eval ${cmd} # build the 'junction_saturation' command # check if sequencing depth is deep enough for alternative splicing analyses. cmd="junction_saturation.py \ -r ${selref} \ -i ${bam} \ -o ${res} \ -l 5 \ -u 100 \ -m 50 \ -v 1" echo "# ${cmd}" eval ${cmd} # build the 'geneBody_coverage' command # compute coverage for all transcript models in percentiles cmd="geneBody_coverage.py \ -r ${selref} \ -i ${bam} \ -o ${res} \ 2>&1 | tee -a ${qc_results}/RSeQC-${name}_log.txt" echo "# ${cmd}" eval ${cmd} # determine reads duplication rate with 'read_duplication.py' cmd="read_duplication.py \ -i ${bam} \ -o ${res} \ -u 500 \ 2>&1 | tee -a ${qc_results}/RSeQC-${name}_log.txt" echo "# ${cmd}" eval ${cmd} echo "# end of RseQC analysis" # rem: the following 'ImageMagick' command converts all PDF outputs to PNG # and replaces internal '.' that would disturb some packages by '_' # not run: # for i in *.pdf; do name=${i/./_}; convert "$i" "${name/.pdf}".png; done
The processing of each BAM file takes some time and we have a reasonably large number of files to test. When the computer allows it becomes in such case quite interesting to use parallel processing. The first piece of code below starts parallel jobs for each BAM file fond in the mapping folder. Results are collected and stored by the script shown next.
Using GNU parallel to run 10 concurrent jobs
mappings=tophat2.0.13_results/markdup_bams find ${mappings}/*_mdup.bam | \ parallel --no-notice --tmpdir /tmp -j 10 'scripts/RSeQC_bam.sh {}'
The results for SRR1039509_all_mdup.bam are reported as illustration
- infer_experiment.py (read more here)
This is PairEnd Data Fraction of reads failed to determine: 0.1137 Fraction of reads explained by "1++,1--,2+-,2-+": 0.4405 Fraction of reads explained by "1+-,1-+,2++,2--": 0.4458
- junction_saturation.py: It’s very important to check if current sequencing depth is deep enough to perform alternative splicing analyses. For a well annotated organism, the number of expressed genes in particular tissue is almost fixed so the number of splice junctions is also fixed. The fixed splice junctions can be predetermined from reference gene model. All (annotated) splice junctions should be rediscovered from a saturated RNA-seq data, otherwise, downstream alternative splicing analysis is problematic because low abundance splice junctions are missing. This module checks for saturation by resampling 5%, 10%, 15%, ..., 95% of total alignments from BAM or SAM file, and then detects splice junctions from each subset and compares them to reference gene model. (read more here)
- geneBody_coverage.py: Read coverage over gene body. This module is used to check if reads coverage is uniform and if there is any 5’ or 3’ bias. (read more here)
- read_duplication.py: Two strategies were used to determine reads duplication rate: * Sequence based: reads with identical sequence are regarded as duplicated reads. * Mapping based: reads mapped to the exactly same genomic location are regarded as duplicated reads. For splice reads, reads mapped to the same starting position and splice the same way are regarded as duplicated reads.
- inner_distance.py: This module is used to calculate the inner distance (or insert size) between two paired RNA reads. The distance is the mRNA length between two paired fragments. We first determine the genomic (DNA) size between two paired reads: D_size = read2_start - read1_end, then ... (read more here)
RSeQC plots for SRR1039509 mappings | ||
---|---|---|
RSeQC plots for chr22_SRR1039509 mappings | ||
---|---|---|
Alternatively you can process all BAM files together with geneBody_coverage.py to obtain a single plot and identify faulty libraries.
bamdata=SRP033351-chr22_bam # create folder for results results=geneBody_coverage-chr22 mkdir -p ${results} # create a list with all BAM files, one per line ls ${bamdata}/*.bam | paste -s -d"\n" > ${results}/list.txt # apply geneBody_coverage.py in batch-mode cmd="geneBody_coverage.py \ -r ref/chr22-hg19_Ensembl.bed \ -i ${results}/list.txt \ -o ${results}/batch-geneBody_all" echo "# ${cmd}" eval ${cmd}
batch geneBody_coverage.py plots | ||
---|---|---|
Qualimap for the QC of BAM data
This recent Java toolbox [5] is another interesting alternative to obtain QC data on BAM files [6] presents a very attractive GUI, very similar to that of fastQC and complemented with the CLI version of each command to allow scripting and implementations at command line. We processed below one BAM file to illustrate the capabilities of Qualimap. Please refer to the documentation if you an to use it with more customization.
Qualimap_chr22-bam.sh script
#!/bin/bash # Qualimap_chr22-bam.sh # bam needs to be sorted by coordinate # takes a BAM file from $1 and performs several qualimap analyses if [ $# -lt 1 ]; then echo "Usage: ${0##*/} <BAM file>" exit fi # locate your qualimap installation folder QUALIMAP=$BIOTOOLS/qualimap # process only one BAM file # batch can be done under bash with a for loop bam=$1 echo "# analysing: ${bam}" name=$(basename "${bam%%_all-tophat.bam}") # create folders qc_reports="Qualimap_results/chr22-reports/${name}" mkdir -p ${qc_reports} qc_counts="Qualimap_results/chr22-counts" mkdir -p ${qc_counts} # we choose the EnsEMBL reference model #gtffile=$BOWTIE2_INDEXES/hg19_ensGene.gtf # but we also have a chr22 subset file gtffile=$BOWTIE2_INDEXES/chr22-hg19_ensGene.gtf ############################## # perform selected Qualimap tests # more tools exist, please refer to the online documentation. # default names sbam=${bam} qbam=${bam} # test if bam is query or coordinate-sorted order="$(samtools view -H ${bam} | head -1 | awk '{split($3, ord, ":"); print ord[2]}')" if [ "$order" = "coordinate" ]; then echo "# sorting in queryname order" # redefine qbam qbam=${bam//.bam/-q.bam} cmd="java -jar $PICARD/picard.jar SortSam \ I=${bam} \ O=${qbam} \ SO=queryname" eval ${cmd} else echo "# sorting in coordinate order" # redefine sbam sbam=${bam//.bam/-s.bam} cmd="java -jar $PICARD/picard.jar SortSam \ I=${bam} \ O=${sbam} \ SO=coordinate" eval ${cmd} fi #echo "# ${cmd}" #eval ${cmd} # run the BAMQC tool on coordinate-sorted data cmd="$QUALIMAP/qualimap bamqc \ -p non-strand-specific \ -bam ${sbam} \ -c \ -gd HUMAN \ -gff ${gtffile} \ -hm 3 \ -nr 1000 \ -nt 4 \ -nw 400 \ -outdir ${qc_reports} \ -outformat PDF \ -outfile ${qc_reports}/${name}-Qualimap-bamqc-report.pdf \ 2>&1 | tee -a ${qc_reports}/qualimap-bamqc_log.txt" echo "# ${cmd}" eval ${cmd} # print separator A=$(printf "%50s\n") echo ${A// /#} echo # run the RNASEQ tool on queryname-sorted data cmd="$QUALIMAP/qualimap rnaseq \ -p non-strand-specific \ -pe \ -bam ${qbam} \ -a uniquely-mapped-reads \ -gtf ${gtffile} \ -outdir ${qc_reports} \ -oc ${qc_counts}/${name}_counts.txt \ -outformat PDF \ -outfile ${qc_reports}/${name}-Qualimap-rnaseq-report.pdf \ 2>&1 | tee -a ${qc_reports}/qualimap-rnaseq_log.txt" echo "# ${cmd}" eval ${cmd} # deleted leftover sorted BAM if [ -f ${bam//.bam/-s.bam} ]; then rm ${bam//.bam/-s.bam} fi if [ -f ${bam//.bam/-s.bam} ]; then rm ${bam//.bam/-q.bam} fi echo "# end of Qualimap analysis"
Example Qualimap analyses results for chr22_CON_24_1_all-tophat.bam (considering only chr22 EnsGene regions): chr22_SRR1039509-Qualimap-bamqc-report.pdf; chr22_SRR1039509-Qualimap-rnaseq-report.pdf
Other tools for QC of mapping data
- A series of new tools have been recently added to the Picard toolbox[7], please consider evaluating these tools when you will have your own data. Picard tools constitute the mainframe of the very popular GATK platform [8] and are really hot [9].
Removing or Marking read duplicates
Do we need it?
The two classes of duplicates will be identified by Picard MarkDuplicates based on their sequence (PCR duplicates) and on their address on the Illumina slide (optical duplicates). In RNASeq analysis, it is not advisable to physically remove duplicate reads as they might in part be the result of cDNA synthesis of distinct mRNA molecules. We therefore limit the duplicate analysis to marking the reads so that next steps in the workflow may take their status into account.
Marking duplicates with Picard MarkDuplicates
bam_markdups.sh script
#!/bin/bash # bam_markdups.sh # mark optical and PCR duplicates from 'SO:coordinate'-sorted BAM file(s) # re-sort output by read name ('SO:queryname') # requires picard # allocate as much RAM as your computer can maxram=4G # mark optical and PCR duplicates from position-sorted BAM file(s) infolder="tophat2.0.13_results" outfolder=${infolder}/markdup_bams mkdir -p ${outfolder} for b in ${infolder}/*_all-tophat.bam; do echo "# processing ${b}" # get prefix name=$(basename ${b} "-tophat.bam") # mark duplicates and index result cmd="java -Xmx${maxram} -jar $PICARD/picard.jar MarkDuplicates \ I=${b} \ O=${outfolder}/${name}_mdup.bam \ M=${outfolder}/${name}_mdup.txt \ AS=TRUE \ 2>&1 | tee -a ${outfolder}/MarkDuplicates_${name}_log.txt" echo "# ${cmd}" eval ${cmd} # re-sort by querynames to comply with HTSeq-count # if last command did exit without error ($? -eq 0) if [ $? -eq 0 ]; then cmd="java -Xmx${maxram} -jar $PICARD/picard.jar SortSam \ I=${outfolder}/${name}_mdup.bam \ O=${outfolder}/s${name}_mdup.bam \ SO=queryname \ CREATE_INDEX=TRUE \ VALIDATION_STRINGENCY=LENIENT \ 2>&1 | tee -a ${outfolder}/MarkDuplicates_${name}_log.txt" echo "# ${cmd}" # delete original file if picard exists successfully eval ${cmd} && mv ${outfolder}/s${name}_mdup.bam \ ${outfolder}/${name}_mdup.bam else echo "# something went wrong!" exit 1 fi # output results to screen from text file echo "# MarkDuplicates results for ${name}" head ${outfolder}/${name}_mdup.txt | grep -v "^#" | transpose -t | column -t echo # exit after one for debug #exit 0 done
The summary counts are located in line 7 and 8 of current log files. To get this data shown as a nice two-column table, use any of the following codelets:
# in unix you often find many ways to do the same thing, pick your favorite! head -8 SRR1039509_all_mdup.txt | tail -2 | transpose -t | column -t -s $'\t' awk 'NR >= 7 && NR <= 8' SRR1039509_all_mdup.txt | transpose -t | column -t -s $'\t' sed -n 7,8p SRR1039509_all_mdup.txt | transpose -t | column -t -s $'\t' perl -ne 'print if 7..8' SRR1039509_all_mdup.txt | transpose -t | column -t -s $'\t'
Picard MarkDuplicate results for the SRR1039509_all sample
head tophat2.0.13_results/markdup_bams/chr22_SRR1039509_all_mdup.txt | grep -v "^#" | transpose -t | column -t LIBRARY Unknown Library UNPAIRED_READS_EXAMINED 951 READ_PAIRS_EXAMINED 388897 UNMAPPED_READS 0 UNPAIRED_READ_DUPLICATES 256 READ_PAIR_DUPLICATES 70056 READ_PAIR_OPTICAL_DUPLICATES 0 PERCENT_DUPLICATION 0.180249 ESTIMATED_LIBRARY_SIZE 945478
download exercise files
Download exercise files here.
References:
- ↑
Fidel Ramírez, Friederike Dündar, Sarah Diehl, Björn A Grüning, Thomas Manke
deepTools: a flexible platform for exploring deep-sequencing data.
Nucleic Acids Res: 2014, 42(Web Server issue);W187-91
[PubMed:24799436] ##WORLDCAT## [DOI] (I p) - ↑ https://github.com/fidelram/deepTools/wiki/All-command-line-options
- ↑
Liguo Wang, Shengqin Wang, Wei Li
RSeQC: quality control of RNA-seq experiments.
Bioinformatics: 2012, 28(16);2184-5
[PubMed:22743226] ##WORLDCAT## [DOI] (I p) - ↑ http://rseqc.sourceforge.net/
- ↑ http://qualimap.bioinfo.cipf.es/
- ↑
Fernando García-Alcalde, Konstantin Okonechnikov, José Carbonell, Luis M Cruz, Stefan Götz, Sonia Tarazona, Joaquín Dopazo, Thomas F Meyer, Ana Conesa
Qualimap: evaluating next-generation sequencing alignment data.
Bioinformatics: 2012, 28(20);2678-9
[PubMed:22914218] ##WORLDCAT## [DOI] (I p) - ↑ http://broadinstitute.github.io/picard/
- ↑
Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, Mark A DePristo
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
Genome Res: 2010, 20(9);1297-303
[PubMed:20644199] ##WORLDCAT## [DOI] (I p) - ↑ https://www.broadinstitute.org/gatk/about/cited-by
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.2 | NGS_RNASeq_DE_Exercise.4 ]