NGS RNASeq DE Exercise.1c
Effect of trimming reads on their ability to map to the reference
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 ]
In this exercise we try to assess the beneficial effect of trimming lower-quality reads on their ability to align to the reference genome and to contribute to gene expression counts. This is a single experiment and does not aim at being an absolute proof but you are free to repeat this comparison with your own 'suboptimal' dataset. The idea is that reads with leftover adaptor sequences of long stretches of low quality bases might be excluded during mapping while the shorted version resulting from trimming could instead be mapped and add to gene counts. The effect of trimming could be beneficial to low-expressed genes for which a small difference in counts may have impact on the final expression value.
Map sample N61311_Dex (SRR1039509) to hg19
The first Dex-teated sample was previously shown to contain linkers as well as present a spot of lower quality reads and was processed using cutadaptor in a separate exercise. The mapping of the raw or trimmed reads was done against the full genome ('all') using tophat and code reproduced below.
tophat_map-all.sh script
#!/bin/bash # tophat_map-all.sh # the script should be called from the base folder # the base folder includes the 'reads' folder named below ## full data # reads="SRP033351_fastq" ## one sample reads="SRR1039509_fastq" ## chr22 subset # reads="chr22_reads" results="tophat2.0.13_results" mkdir -p ${results} # folder and file options BOWTIE2_INDEXES=$BIODATA/bowtie2_indexes fasta=$BOWTIE2_INDEXES/hg19.fa index=$BOWTIE2_INDEXES/hg19 gtffile=$BOWTIE2_INDEXES/hg19_ensGene.gtf gtfindex=$BOWTIE2_INDEXES/hg19_ensGene # how hard can your computer work? (max cpu# that may be used) nthr=1 # using 'nthr' processors maxram="2G" # on a stronger, this will run faster #nthr=24 # using 'nthr' processors #maxram="64G" # map single ends in (un-guided|guided) modes (not|using) the gtf file for fq1 in $(ls ${reads}/*_1.fastq.gz); do # get the base-name without end_suffix name=$(basename "${fq1}" "_1.fastq.gz") echo echo "# mapping the ${name} pair against the whole genome" # deduce second read file name fq2=$(echo "$fq1" | sed -r 's/_1.fastq.gz/_2.fastq.gz/') # unguided mode # execute the command and write summary results in a 'log' file cmd="tophat -p ${nthr} \ -o ${results}/${name}_all-mappings \ --no-coverage-search \ ${index} \ ${fq1} ${fq2} \ 2>&1 | tee -a ${results}/tophat-all-log_${name}.txt" echo "# ${cmd}" eval ${cmd} # if tophat was a success, sort, rename, and index mapping results # run in background with good resource while starting the next loop on 1 thread if [ $? = 0 ]; then # use 4 threads and up to 4GB RAM (default are 1 thread and 768M) samtools sort -@ ${nthr} -m ${maxram} \ ${results}/${name}_all-mappings/accepted_hits.bam \ ${results}/${name}_all-tophat \ && samtools index ${results}/${name}_all-tophat.bam # also perform basic BAM QC on mappings # the code is provided in a separate script 'mapping_metrics.sh' scripts/mapping_metrics.sh ${results}/${name}_all-tophat.bam fi echo # print separator A=$(printf "%50s\n") echo ${A// /#} echo # abort after first loop debug-mode # exit 0 done
tophat_map-all-trimmed.sh script
#!/bin/bash # tophat_map-all-trimmed.sh # the script should be called from the base folder # the base folder includes the 'reads' folder named below ## one sample trimmed reads="trimmed-SRP033351_fastq" # trimmed results results="tophat2.0.13-SRR1039509-trimmed_results" mkdir -p ${results} # folder and file options BOWTIE2_INDEXES=$BIODATA/bowtie2_indexes fasta=$BOWTIE2_INDEXES/hg19.fa index=$BOWTIE2_INDEXES/hg19 gtffile=$BOWTIE2_INDEXES/hg19_ensGene.gtf gtfindex=$BOWTIE2_INDEXES/hg19_ensGene # how hard can your computer work? (max cpu# that may be used) #nthr=1 # using 'nthr' processors #maxram="4G" # on a stronger, this will run faster nthr=24 # using 'nthr' processors maxram="64G" # map single ends in (un-guided|guided) modes (not|using) the gtf file for fq1 in $(ls ${reads}/*_1.fastq.gz); do # get the base-name without end_suffix name=$(basename "${fq1}" "_1.fastq.gz") echo echo "# mapping the ${name} pair against the whole genome" # deduce second read file name fq2=$(echo "$fq1" | sed -r 's/_1.fastq.gz/_2.fastq.gz/') # unguided mode # execute the command and write summary results in a 'log' file cmd="tophat -p ${nthr} \ -o ${results}/${name}_all-mappings \ --no-coverage-search \ ${index} \ ${fq1} ${fq2} \ 2>&1 | tee -a ${results}/tophat-all-log_${name}.txt" echo "# ${cmd}" eval ${cmd} # if tophat was a success, sort, rename, and index mapping results # run in background with good resource while starting the next loop on 1 thread if [ $? = 0 ]; then # use 4 threads and up to 4GB RAM (default are 1 thread and 768M) samtools sort -@ ${nthr} -m ${maxram} \ ${results}/${name}_all-mappings/accepted_hits.bam \ ${results}/${name}_all-tophat \ && samtools index ${results}/${name}_all-tophat.bam # also perform basic BAM QC on mappings # the code is provided in a separate script 'mapping_metrics.sh' scripts/mapping_metrics.sh ${results}/${name}_all-tophat.bam fi echo # print separator A=$(printf "%50s\n") echo ${A// /#} echo # abort after first loop debug-mode # exit 0 done
Compare mapping data
Different packages allow comparing BAM files pairwise, including bedtools, picard, and qualimap. We applied each tool here to provide you code that can be adapted to your needs. Qualimap will be used in a later exercise to compare all-16 samples based on their chr22 data.
bedtools_bam-compare.sh script
#!/bin/bash # bedtools_bam-compare.sh # compare the results of 'all' tophat mapping # between unprocessed and trimmed SRR1039509 reads # parameters orimap="tophat2.0.13-SRR1039509_results/SRR1039509_all-tophat.bam" trimmedmap="tophat2.0.13-SRR1039509-trimmed_results/SRR1039509_all-tophat.bam" results=trimming-effect_SRR1039509/bedtools_bam-compare mkdir -p ${results} # extract the coverage from each bam in bedgraph format file using bedtools genomecov bedtools genomecov -bg \ -ibam ${orimap} \ | bgzip -c > ${results}/orimap.bedgraph.gz bedtools genomecov -bg \ -ibam ${trimmedmap} \ | bgzip -c > ${results}/trimmedmap.bedgraph.gz # process both coverage bedgraph files using bedtools unionbedg hg19idx=$BOWTIE2_INDEXES/hg19.fa.fai bedtools unionbedg -header \ -i ${results}/orimap.bedgraph.gz ${results}/trimmedmap.bedgraph.gz \ -names 'raw-reads' 'trimmed-reads' \ -g ${hg19idx} \ -empty \ | bgzip -c > ${results}/bedtools-unionbedg.bg.gz
picard_bam-compare.sh script
#!/bin/bash # picard_bam-compare.sh # compair 'all' mappings pairwise results=trimming-effect_SRR1039509/picard_bam-compare mkdir -p ${results} # parameters orimap="tophat2.0.13-SRR1039509_results/SRR1039509_all-tophat.bam" trimmedmap="tophat2.0.13-SRR1039509-trimmed_results/SRR1039509_all-tophat.bam" java -Xmx4g -jar $PICARD/picard.jar CompareSAMs \ ${orimap} \ ${trimmedmap} \ >${results}/picard_compare_${met}.txt \ 2>${results}/CompareSAMs_${met}.log
deeptool_bamCorrelate.sh script
#!/bin/bash # deeptool_bamCorrelate.sh # required: # deeptool bamCorrelate # https://github.com/fidelram/deepTools/wiki/QC#bamCorrelate results=trimming-effect_SRR1039509/deeptool_bamCorrelate mkdir -p ${results} echo "# correlation for SRR1039509_all tophat results' # parameters orimap="tophat2.0.13-SRR1039509_results/SRR1039509_all-tophat.bam" trimmedmap="tophat2.0.13-SRR1039509-trimmed_results/SRR1039509_all-tophat.bam" corM=spearman # spearman or pearson outfile=SRR1039509_all-mapping_${corM}-correlation-plot.png bamCorrelate bins\ --bamfiles ${orimap} ${trimmedmap} \ --binSize 10000 \ --labels 'raw-reads' 'trimmed-reads' \ --plotFile ${results}/${outfile} \ --corMethod ${corM}
download exercise files
Download exercise files here.
References:
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 | NGS_RNASeq_DE_Exercise.2 ]