NGS RNASeq DE Exercise.1c

From BITS wiki
Jump to: navigation, search

Effect of trimming reads on their ability to map to the reference

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 ]


In this exercise we try to assess the beneficial effect of trimming lower-quality reads on their ability to align to the reference genome and to contribute to gene expression counts. This is a single experiment and does not aim at being an absolute proof but you are free to repeat this comparison with your own 'suboptimal' dataset. The idea is that reads with leftover adaptor sequences of long stretches of low quality bases might be excluded during mapping while the shorted version resulting from trimming could instead be mapped and add to gene counts. The effect of trimming could be beneficial to low-expressed genes for which a small difference in counts may have impact on the final expression value.

Map sample N61311_Dex (SRR1039509) to hg19

The first Dex-teated sample was previously shown to contain linkers as well as present a spot of lower quality reads and was processed using cutadaptor in a separate exercise. The mapping of the raw or trimmed reads was done against the full genome ('all') using tophat and code reproduced below.

tophat_map-all.sh script

#!/bin/bash
# tophat_map-all.sh
 
# the script should be called from the base folder
# the base folder includes the 'reads' folder named below
 
## full data
# reads="SRP033351_fastq"
 
## one sample
reads="SRR1039509_fastq"
 
## chr22 subset
# reads="chr22_reads"
 
results="tophat2.0.13_results"
mkdir -p ${results}
 
# folder and file options
BOWTIE2_INDEXES=$BIODATA/bowtie2_indexes
fasta=$BOWTIE2_INDEXES/hg19.fa
index=$BOWTIE2_INDEXES/hg19
gtffile=$BOWTIE2_INDEXES/hg19_ensGene.gtf
gtfindex=$BOWTIE2_INDEXES/hg19_ensGene
 
# how hard can your computer work? (max cpu# that may be used)
nthr=1 # using 'nthr' processors
maxram="2G"
 
# on a stronger, this will run faster
#nthr=24 # using 'nthr' processors
#maxram="64G"
 
# map single ends in (un-guided|guided) modes (not|using) the gtf file
for fq1 in $(ls ${reads}/*_1.fastq.gz); do
 
 # get the base-name without end_suffix
 name=$(basename "${fq1}" "_1.fastq.gz")
 echo
 echo "# mapping the ${name} pair against the whole genome"
 
 # deduce second read file name
 fq2=$(echo "$fq1" | sed -r 's/_1.fastq.gz/_2.fastq.gz/')
 
 # unguided mode
 # execute the command and write summary results in a 'log' file
 cmd="tophat -p ${nthr} \
 	-o ${results}/${name}_all-mappings \
 	--no-coverage-search \
 	${index} \
 	${fq1} ${fq2} \
 	2>&1 | tee -a ${results}/tophat-all-log_${name}.txt"
 
 echo "# ${cmd}"
 eval ${cmd}
 
 # if tophat was a success, sort, rename, and index mapping results
 # run in background with good resource while starting the next loop on 1 thread
 if [ $? = 0 ]; then
  # use 4 threads and up to 4GB RAM (default are 1 thread and 768M)
  samtools sort -@ ${nthr} -m ${maxram} \
  	${results}/${name}_all-mappings/accepted_hits.bam \
  	${results}/${name}_all-tophat \
	&& samtools index ${results}/${name}_all-tophat.bam
 
  # also perform basic BAM QC on mappings
  # the code is provided in a separate script 'mapping_metrics.sh'
  scripts/mapping_metrics.sh ${results}/${name}_all-tophat.bam
 fi
 
 echo
 
# print separator
A=$(printf "%50s\n")
echo ${A// /#}
echo
 
# abort after first loop debug-mode
# exit 0
 
done

tophat_map-all-trimmed.sh script

#!/bin/bash
# tophat_map-all-trimmed.sh
 
# the script should be called from the base folder
# the base folder includes the 'reads' folder named below
 
## one sample trimmed
reads="trimmed-SRP033351_fastq"
 
# trimmed results
results="tophat2.0.13-SRR1039509-trimmed_results"
 
mkdir -p ${results}
 
# folder and file options
BOWTIE2_INDEXES=$BIODATA/bowtie2_indexes
fasta=$BOWTIE2_INDEXES/hg19.fa
index=$BOWTIE2_INDEXES/hg19
gtffile=$BOWTIE2_INDEXES/hg19_ensGene.gtf
gtfindex=$BOWTIE2_INDEXES/hg19_ensGene
 
# how hard can your computer work? (max cpu# that may be used)
#nthr=1 # using 'nthr' processors
#maxram="4G"
 
# on a stronger, this will run faster
nthr=24 # using 'nthr' processors
maxram="64G"
 
# map single ends in (un-guided|guided) modes (not|using) the gtf file
for fq1 in $(ls ${reads}/*_1.fastq.gz); do
 
 # get the base-name without end_suffix
 name=$(basename "${fq1}" "_1.fastq.gz")
 echo
 echo "# mapping the ${name} pair against the whole genome"
 
 # deduce second read file name
 fq2=$(echo "$fq1" | sed -r 's/_1.fastq.gz/_2.fastq.gz/')
 
 # unguided mode
 # execute the command and write summary results in a 'log' file
 cmd="tophat -p ${nthr} \
 	-o ${results}/${name}_all-mappings \
 	--no-coverage-search \
 	${index} \
 	${fq1} ${fq2} \
 	2>&1 | tee -a ${results}/tophat-all-log_${name}.txt"
 
 echo "# ${cmd}"
 eval ${cmd}
 
 # if tophat was a success, sort, rename, and index mapping results
 # run in background with good resource while starting the next loop on 1 thread
 if [ $? = 0 ]; then
  # use 4 threads and up to 4GB RAM (default are 1 thread and 768M)
  samtools sort -@ ${nthr} -m ${maxram} \
  	${results}/${name}_all-mappings/accepted_hits.bam \
  	${results}/${name}_all-tophat \
	&& samtools index ${results}/${name}_all-tophat.bam
 
  # also perform basic BAM QC on mappings
  # the code is provided in a separate script 'mapping_metrics.sh'
  scripts/mapping_metrics.sh ${results}/${name}_all-tophat.bam
 fi
 
 echo
 
# print separator
A=$(printf "%50s\n")
echo ${A// /#}
echo
 
# abort after first loop debug-mode
# exit 0
 
done

 

Compare mapping data

Different packages allow comparing BAM files pairwise, including bedtools, picard, and qualimap. We applied each tool here to provide you code that can be adapted to your needs. Qualimap will be used in a later exercise to compare all-16 samples based on their chr22 data.

bedtools_bam-compare.sh script

#!/bin/bash
# bedtools_bam-compare.sh
 
# compare the results of 'all' tophat mapping
# between unprocessed and trimmed SRR1039509 reads
 
# parameters
orimap="tophat2.0.13-SRR1039509_results/SRR1039509_all-tophat.bam"
trimmedmap="tophat2.0.13-SRR1039509-trimmed_results/SRR1039509_all-tophat.bam"
 
results=trimming-effect_SRR1039509/bedtools_bam-compare
mkdir -p ${results}
 
# extract the coverage from each bam in bedgraph format file using bedtools genomecov
bedtools genomecov -bg \
    -ibam ${orimap} \
    | bgzip -c > ${results}/orimap.bedgraph.gz
 
bedtools genomecov -bg \
    -ibam ${trimmedmap} \
    | bgzip -c > ${results}/trimmedmap.bedgraph.gz
 
# process both coverage bedgraph files using bedtools unionbedg
hg19idx=$BOWTIE2_INDEXES/hg19.fa.fai
 
bedtools unionbedg -header \
    -i ${results}/orimap.bedgraph.gz ${results}/trimmedmap.bedgraph.gz \
    -names 'raw-reads' 'trimmed-reads' \
    -g ${hg19idx} \
    -empty \
    | bgzip -c > ${results}/bedtools-unionbedg.bg.gz

picard_bam-compare.sh script

#!/bin/bash
# picard_bam-compare.sh
 
# compair 'all' mappings pairwise
 
results=trimming-effect_SRR1039509/picard_bam-compare
mkdir -p ${results}
 
# parameters
orimap="tophat2.0.13-SRR1039509_results/SRR1039509_all-tophat.bam"
trimmedmap="tophat2.0.13-SRR1039509-trimmed_results/SRR1039509_all-tophat.bam"
 
java -Xmx4g -jar $PICARD/picard.jar CompareSAMs \
	${orimap} \
	${trimmedmap} \
	>${results}/picard_compare_${met}.txt \
	2>${results}/CompareSAMs_${met}.log

deeptool_bamCorrelate.sh script

#!/bin/bash
# deeptool_bamCorrelate.sh
 
# required:
# deeptool bamCorrelate
# https://github.com/fidelram/deepTools/wiki/QC#bamCorrelate
 
results=trimming-effect_SRR1039509/deeptool_bamCorrelate
mkdir -p ${results}
 
echo "# correlation for SRR1039509_all tophat results'
 
# parameters
orimap="tophat2.0.13-SRR1039509_results/SRR1039509_all-tophat.bam"
trimmedmap="tophat2.0.13-SRR1039509-trimmed_results/SRR1039509_all-tophat.bam"
 
corM=spearman # spearman or pearson
outfile=SRR1039509_all-mapping_${corM}-correlation-plot.png
 
bamCorrelate bins\
    --bamfiles ${orimap} ${trimmedmap} \
    --binSize 10000 \
    --labels 'raw-reads' 'trimmed-reads' \
    --plotFile ${results}/${outfile} \
    --corMethod ${corM}

 

download exercise files

Download exercise files here.

Use the right application to open the files present in ex1-files

References:

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 | NGS_RNASeq_DE_Exercise.2 ]