Map all reads to the current reference (hg19) and extract paired-end reads mapping to chr22

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 | NGS_RNASeq_DE_Exercise.2 ]

map sample N61311_Dex (SRR1039509) to hg19

The mapping was performed with the same code that will be used in NGS_RNASeq_DE_Exercise.2 but taking as input only one sample (SRR1039509).

tophat_map.sh script

#!/bin/bash
# tophat_map.sh
 
# the script should be called from the base folder
# the base folder includes the 'reads' folder named below
 
## full data
# reads="SRP033351_fastq"
 
## one sample
reads="SRR1039509_fastq"
 
## chr22 subset
# reads="chr22_reads"
 
results="tophat2.0.13_results"
mkdir -p ${results}
 
# folder and file options
BOWTIE2_INDEXES=$BIODATA/bowtie2_indexes
fasta=$BOWTIE2_INDEXES/hg19.fa
index=$BOWTIE2_INDEXES/hg19
gtffile=$BOWTIE2_INDEXES/hg19_ensGene.gtf
gtfindex=$BOWTIE2_INDEXES/hg19_ensGene
 
# how hard can your computer work? (max cpu# that may be used)
#nthr=1 # using 'nthr' processors
#maxram="4G"
 
# on a stronger, this will run faster
nthr=24 # using 'nthr' processors
maxram="64G"
 
# map single ends in (un-guided|guided) modes (not|using) the gtf file
for fq1 in $(ls ${reads}/*_1.fastq.gz); do
 
 # get the base-name without end_suffix
 name=$(basename "${fq1}" "_1.fastq.gz")
 echo
 echo "# mapping the ${name} pair against the whole genome"
 
 # deduce second read file name
 fq2=$(echo "$fq1" | sed -r 's/_1.fastq.gz/_2.fastq.gz/')
 
 # unguided mode
 # execute the command and write summary results in a 'log' file
 cmd="tophat -p ${nthr} \
 	-o ${results}/${name}_all-mappings \
 	--no-coverage-search \
 	${index} \
 	${fq1} ${fq2} \
 	2>&1 | tee -a ${results}/tophat-all-log_${name}.txt"
 
 echo "# ${cmd}"
 eval ${cmd}
 
 # if tophat was a success, sort, rename, and index mapping results
 # run in background with good resource while starting the next loop on 1 thread
 if [ $? = 0 ]; then
  # use 4 threads and up to 4GB RAM (default are 1 thread and 768M)
  samtools sort -@ ${nthr} -m ${maxram} \
  	${results}/${name}_all-mappings/accepted_hits.bam \
  	${results}/${name}_all-tophat \
	&& samtools index ${results}/${name}_all-tophat.bam
 
  # also perform basic BAM QC on mappings
  # the code is provided in a separate script 'mapping_metrics.sh'
  scripts/mapping_metrics.sh ${results}/${name}_all-tophat.bam
 fi
 
 echo
 echo "# mapping the ${name} pair against the 'ensGene' transcript models"
 
 # guided mode with 'ensGene' as model
 # execute the command and write summary results in a 'log' file
 # first time -G ${gtffile} then --transcriptome-index ${gtfindex}
 cmd="tophat \
 	-p ${nthr} \
 	-o ${results}/${name}_gtf-mappings \
 	--transcriptome-index ${gtfindex} \
 	--no-coverage-search \
 	${index} \
 	${fq1} ${fq2} \
 	2>&1 | tee -a ${results}/tophat-gtf-log_${name}.txt"
 
 echo "# ${cmd}"
 eval ${cmd}
 
 # if tophat was a success, sort, rename, and index mapping results
 if [ $? = 0 ]; then
  # use $nthr threads and up to $maxram RAM (default are 1 thread and 768M)
  samtools sort -@ ${nthr} -m ${maxram} \
  	${results}/${name}_gtf-mappings/accepted_hits.bam \
  	${results}/${name}_gtf-tophat \
  	&& samtools index ${results}/${name}_gtf-tophat.bam
 
  # also perform basic BAM QC on mappings
  # the code is provided in a separate script 'mapping_metrics.sh'
  scripts/mapping_metrics.sh ${results}/${name}_gtf-tophat.bam
 fi
 
# print separator
A=$(printf "%50s\n")
echo ${A// /#}
echo
 
# abort after first loop debug-mode
# exit 0
 
done

results after marking duplicates

# MarkDuplicates results for SRR1039509_all
LIBRARY Unknown Library
UNPAIRED_READS_EXAMINED 1144473
READ_PAIRS_EXAMINED 19431124
UNMAPPED_READS 0
UNPAIRED_READ_DUPLICATES 744108
READ_PAIR_DUPLICATES 2534219
READ_PAIR_OPTICAL_DUPLICATES 0
PERCENT_DUPLICATION 0.145289
ESTIMATED_LIBRARY_SIZE 67865503

# MarkDuplicates results for SRR1039509_gtf
LIBRARY Unknown Library
UNPAIRED_READS_EXAMINED 910020
READ_PAIRS_EXAMINED 19686964
UNMAPPED_READS 0
UNPAIRED_READ_DUPLICATES 620511
READ_PAIR_DUPLICATES 2662715
READ_PAIR_OPTICAL_DUPLICATES 0
PERCENT_DUPLICATION 0.147601
ESTIMATED_LIBRARY_SIZE 66056461

extract paired reads mapping to chr22

This step was required to create a smaller sized dataset that could be used during our very short training session and due to the very limited capacity of the training laptops as compared to the BITS server.

We choose the chr22 with a total size of 51Mb (51,304,566 bp) of which the first ~15Mb (short-arm) do not carry known genes, thereby representing <1.6% of the whole human genome (3,137,144,693 bps for GRCh37). All reads mapping at either end to the chr22 were retained during this step. This ensures a maximal recovery of candidate chr22 sequences and introduces a small amount of noise due to multiple mapping reads that may belong to other chromosome regions

Extracting reads from mappings is not something you will do everyday, but it can be a nice exercise if you find a good dataset published some time ago and only avaiblable as BAM and if you would like to improve the results by re-mapping the reads using a more accurate method or a more recent reference genome. In that case, you will need to extract the reads from the BAM data as described below.

We process here only 'all' mapping data (BAM files obtained with tophat using the full genome as reference and sorted by coordinate). In order to simulate read paired data, we used samtools flag filtering to keep only properly paired mappings. After coordinate filtering, the selected mappings were re-sorted in queryname order before final fastQ extraction using picar, thereby creating a subset resembling to real read data.

extract_chr22_to_fastq.sh script

#!/bin/bash
# extract_chr22_to_fastq.sh
# run on each BAM file
 
# edit here
maxram="4G"
infolder=tophat2.0.13-SRR1039509_results
outfolder=SRR1039509-chr22_fastq
mkdir -p ${outfolder}
 
bam=${infolder}/SRR1039509_all-tophat.bam
sbam=${bam//.bam/_s.bam}
 
echo "# processing ${bam}"
 
# get base name
b=$(basename ${bam} "_all-tophat.bam")
 
# sorted version required for region extraction
java -Xmx${maxram} -jar $PICARD/picard.jar SortSam \
	I=${bam} \
	O=${sbam} \
	SO=coordinate \
	MAX_RECORDS_IN_RAM=2000000
 
# index the obtained data before extraction
# same as: samtools index ${sbam}
java -Xmx${maxram} -jar $PICARD/picard.jar BuildBamIndex \
	I=${sbam} \
	O=${sbam}.bai
 
# extract chr22 mappings
# only mapped and properly paired reads (tag=0x2)
samtools view -b -f 0x2 ${sbam} chr22 \
	> ${infolder}/chr22_${b}_all.bam
 
# re-sort by read names to group reads in pairs
java -Xmx${maxram} -jar $PICARD/picard.jar SortSam \
	I=${infolder}/chr22_${b}_all.bam \
	O=${infolder}/chr22_${b}_all-n.bam \
	SO=queryname \
	MAX_RECORDS_IN_RAM=2000000
 
# convert back to fastq using picard
# be LENIENT for unpaired reads
java -Xmx${maxram} -jar $PICARD/picard.jar SamToFastq \
	I=${infolder}/chr22_${b}_all-n.bam \
	F=>(bgzip -c > ${outfolder}/chr22_${b}_1.fastq.gz) \
	F2=>(bgzip -c > ${outfolder}/chr22_${b}_2.fastq.gz) \
	RE_REVERSE=TRUE \
	VALIDATION_STRINGENCY=LENIENT

This method was based on a number of pages including an excellent Broad-GATK overview ^[1] and a number of notes posted in SeqAnswers ^[2]

download exercise files

Download exercise files here.

Use the right application to open the files present in ex1-files

References:

[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 | NGS_RNASeq_DE_Exercise.2 ]

[1] ttp://gatkforums.broadinstitute.org/discussion/2908/howto-revert-a-bam-file-to-fastq-format

[2] ttp://seqanswers.com/forums/showthread.php?t=16433

[1]

[2]

NGS RNASeq DE Exercise.1b

map sample N61311_Dex (SRR1039509) to hg19

extract paired reads mapping to chr22

download exercise files

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox