NGS RNASeq DE Exercise.1b
Map all reads to the current reference (hg19) and extract paired-end reads mapping to chr22
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 | NGS_RNASeq_DE_Exercise.2 ]
map sample N61311_Dex (SRR1039509) to hg19
The mapping was performed with the same code that will be used in NGS_RNASeq_DE_Exercise.2 but taking as input only one sample (SRR1039509).
tophat_map.sh script
#!/bin/bash # tophat_map.sh # the script should be called from the base folder # the base folder includes the 'reads' folder named below ## full data # reads="SRP033351_fastq" ## one sample reads="SRR1039509_fastq" ## chr22 subset # reads="chr22_reads" results="tophat2.0.13_results" mkdir -p ${results} # folder and file options BOWTIE2_INDEXES=$BIODATA/bowtie2_indexes fasta=$BOWTIE2_INDEXES/hg19.fa index=$BOWTIE2_INDEXES/hg19 gtffile=$BOWTIE2_INDEXES/hg19_ensGene.gtf gtfindex=$BOWTIE2_INDEXES/hg19_ensGene # how hard can your computer work? (max cpu# that may be used) #nthr=1 # using 'nthr' processors #maxram="4G" # on a stronger, this will run faster nthr=24 # using 'nthr' processors maxram="64G" # map single ends in (un-guided|guided) modes (not|using) the gtf file for fq1 in $(ls ${reads}/*_1.fastq.gz); do # get the base-name without end_suffix name=$(basename "${fq1}" "_1.fastq.gz") echo echo "# mapping the ${name} pair against the whole genome" # deduce second read file name fq2=$(echo "$fq1" | sed -r 's/_1.fastq.gz/_2.fastq.gz/') # unguided mode # execute the command and write summary results in a 'log' file cmd="tophat -p ${nthr} \ -o ${results}/${name}_all-mappings \ --no-coverage-search \ ${index} \ ${fq1} ${fq2} \ 2>&1 | tee -a ${results}/tophat-all-log_${name}.txt" echo "# ${cmd}" eval ${cmd} # if tophat was a success, sort, rename, and index mapping results # run in background with good resource while starting the next loop on 1 thread if [ $? = 0 ]; then # use 4 threads and up to 4GB RAM (default are 1 thread and 768M) samtools sort -@ ${nthr} -m ${maxram} \ ${results}/${name}_all-mappings/accepted_hits.bam \ ${results}/${name}_all-tophat \ && samtools index ${results}/${name}_all-tophat.bam # also perform basic BAM QC on mappings # the code is provided in a separate script 'mapping_metrics.sh' scripts/mapping_metrics.sh ${results}/${name}_all-tophat.bam fi echo echo "# mapping the ${name} pair against the 'ensGene' transcript models" # guided mode with 'ensGene' as model # execute the command and write summary results in a 'log' file # first time -G ${gtffile} then --transcriptome-index ${gtfindex} cmd="tophat \ -p ${nthr} \ -o ${results}/${name}_gtf-mappings \ --transcriptome-index ${gtfindex} \ --no-coverage-search \ ${index} \ ${fq1} ${fq2} \ 2>&1 | tee -a ${results}/tophat-gtf-log_${name}.txt" echo "# ${cmd}" eval ${cmd} # if tophat was a success, sort, rename, and index mapping results if [ $? = 0 ]; then # use $nthr threads and up to $maxram RAM (default are 1 thread and 768M) samtools sort -@ ${nthr} -m ${maxram} \ ${results}/${name}_gtf-mappings/accepted_hits.bam \ ${results}/${name}_gtf-tophat \ && samtools index ${results}/${name}_gtf-tophat.bam # also perform basic BAM QC on mappings # the code is provided in a separate script 'mapping_metrics.sh' scripts/mapping_metrics.sh ${results}/${name}_gtf-tophat.bam fi # print separator A=$(printf "%50s\n") echo ${A// /#} echo # abort after first loop debug-mode # exit 0 done
results after marking duplicates
LIBRARY Unknown Library
UNPAIRED_READS_EXAMINED 1144473
READ_PAIRS_EXAMINED 19431124
UNMAPPED_READS 0
UNPAIRED_READ_DUPLICATES 744108
READ_PAIR_DUPLICATES 2534219
READ_PAIR_OPTICAL_DUPLICATES 0
PERCENT_DUPLICATION 0.145289
ESTIMATED_LIBRARY_SIZE 67865503
# MarkDuplicates results for SRR1039509_gtf
LIBRARY Unknown Library
UNPAIRED_READS_EXAMINED 910020
READ_PAIRS_EXAMINED 19686964
UNMAPPED_READS 0
UNPAIRED_READ_DUPLICATES 620511
READ_PAIR_DUPLICATES 2662715
READ_PAIR_OPTICAL_DUPLICATES 0
PERCENT_DUPLICATION 0.147601
ESTIMATED_LIBRARY_SIZE 66056461
extract paired reads mapping to chr22
This step was required to create a smaller sized dataset that could be used during our very short training session and due to the very limited capacity of the training laptops as compared to the BITS server.
We choose the chr22 with a total size of 51Mb (51,304,566 bp) of which the first ~15Mb (short-arm) do not carry known genes, thereby representing <1.6% of the whole human genome (3,137,144,693 bps for GRCh37). All reads mapping at either end to the chr22 were retained during this step. This ensures a maximal recovery of candidate chr22 sequences and introduces a small amount of noise due to multiple mapping reads that may belong to other chromosome regions
Extracting reads from mappings is not something you will do everyday, but it can be a nice exercise if you find a good dataset published some time ago and only avaiblable as BAM and if you would like to improve the results by re-mapping the reads using a more accurate method or a more recent reference genome. In that case, you will need to extract the reads from the BAM data as described below.
We process here only 'all' mapping data (BAM files obtained with tophat using the full genome as reference and sorted by coordinate). In order to simulate read paired data, we used samtools flag filtering to keep only properly paired mappings. After coordinate filtering, the selected mappings were re-sorted in queryname order before final fastQ extraction using picar, thereby creating a subset resembling to real read data.
extract_chr22_to_fastq.sh script
#!/bin/bash # extract_chr22_to_fastq.sh # run on each BAM file # edit here maxram="4G" infolder=tophat2.0.13-SRR1039509_results outfolder=SRR1039509-chr22_fastq mkdir -p ${outfolder} bam=${infolder}/SRR1039509_all-tophat.bam sbam=${bam//.bam/_s.bam} echo "# processing ${bam}" # get base name b=$(basename ${bam} "_all-tophat.bam") # sorted version required for region extraction java -Xmx${maxram} -jar $PICARD/picard.jar SortSam \ I=${bam} \ O=${sbam} \ SO=coordinate \ MAX_RECORDS_IN_RAM=2000000 # index the obtained data before extraction # same as: samtools index ${sbam} java -Xmx${maxram} -jar $PICARD/picard.jar BuildBamIndex \ I=${sbam} \ O=${sbam}.bai # extract chr22 mappings # only mapped and properly paired reads (tag=0x2) samtools view -b -f 0x2 ${sbam} chr22 \ > ${infolder}/chr22_${b}_all.bam # re-sort by read names to group reads in pairs java -Xmx${maxram} -jar $PICARD/picard.jar SortSam \ I=${infolder}/chr22_${b}_all.bam \ O=${infolder}/chr22_${b}_all-n.bam \ SO=queryname \ MAX_RECORDS_IN_RAM=2000000 # convert back to fastq using picard # be LENIENT for unpaired reads java -Xmx${maxram} -jar $PICARD/picard.jar SamToFastq \ I=${infolder}/chr22_${b}_all-n.bam \ F=>(bgzip -c > ${outfolder}/chr22_${b}_1.fastq.gz) \ F2=>(bgzip -c > ${outfolder}/chr22_${b}_2.fastq.gz) \ RE_REVERSE=TRUE \ VALIDATION_STRINGENCY=LENIENT
This method was based on a number of pages including an excellent Broad-GATK overview [1] and a number of notes posted in SeqAnswers [2]
download exercise files
Download exercise files here.
References:
- ↑ http://gatkforums.broadinstitute.org/discussion/2908/howto-revert-a-bam-file-to-fastq-format
- ↑ http://seqanswers.com/forums/showthread.php?t=16433
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 | NGS_RNASeq_DE_Exercise.2 ]