NGS-Var Exercise.3
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.2 | NGS-Var Exercise.4 ]
BWA mapping QC using Samtools, Picard, and Qualimap
Contents
Introduction
Each mapping is unique and can be biased by many unknown factors. Performing QC is primordial in each experiment to avoid carryover of errors to the downstream analyses.
summarise mapping results and perform mapping QC using Picard tools
As shown before, the samtools flagstat command provides very basic metrics. For more complex issues, other types of control are necessary and can be performed using specific picard tools.
apply Picard CollectGcBiasMetrics
Monotonous GC/AT rich regions lead to ambiguous mapping and eventually to no-mapping at all if they exceed the mapper capabilities. This is a well known problem in short read sequencing and although paired end reads limit this bias, it is, together together with repeats still responsible for the absence of coverage in highly repetitive regions, telomers, and centromers. The following QC run will show the limits of mappability associated with GC content for a given BAM file and possibly identify a technical issue.
CollectGcBiasMetrics is a High level metrics that capture how biased the coverage in a certain BAM file is.
#! /usr/bin/env bash ## script: 'picard_CollectGcBiasMetrics.sh' ## ©SP-BITS, 2013 v1.0 # last edit: 2016-02-19 # Picard Version: 2.1.0 infolder=hg19_bwa-mapping outfolder=bwa-mappingQC mkdir -p ${outfolder} ref=ref/HiSeq_UCSC_hg19.fa # 10% sample infile=shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam # full data # infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam echo "# processing ${infile}" # send errors log file with '2>' java -jar $PICARD/picard.jar CollectGcBiasMetrics \ VERBOSITY=ERROR \ R=${ref} \ I=${infolder}/${infile} \ O=${outfolder}/${infile%%.bam}_GCbias.txt \ CHART=${outfolder}/${infile%%.bam}_GCbias.pdf \ S=${outfolder}/${infile%%.bam}_GCbias_summary.txt \ 2>${outfolder}/${infile%%.bam}_CollectGcBiasMetrics.err echo
The CollectGcBiasMetrics program produces 2 text files and a corresponding graph in pdf format.
comparison for sample & all datasets | |
---|---|
apply Picard CollectAlignmentSummaryMetrics
This collects High level metrics about the alignment of reads within a SAM file. The result is a table that can be transposed for better readability (below)
#! /usr/bin/env bash ## script: 'picard_CollectAlignmentSummaryMetrics.sh' ## ©SP-BITS, 2013 v1.0 # last edit: 2016-02-19 # Picard Version: 2.1.0 infolder=hg19_bwa-mapping outfolder=bwa-mappingQC mkdir -p ${outfolder} ref=ref/HiSeq_UCSC_hg19.fa # 10% sample infile=shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam # full data # infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam echo "# processing ${infile}" # collect information (java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ R=${ref} \ I=${infolder}/${infile} \ O=${outfolder}/${infile%%.bam}_AlignmentSummaryMetrics.txt \ 2>${outfolder}/${infile%%.bam}_CollectAlignmentSummaryMetrics.err) # ignore header and empty lines and rotate the remaining content grep -v -E '(^#|^$)' \ ${outfolder}/${infile%%.bam}_AlignmentSummaryMetrics.txt | \ transpose -t | \ column -t > \ ${outfolder}/tr_${infile%%.bam}_AlignmentSummaryMetrics.txt echo
what did we learn here?
more regular expression symbols
^ (Caret) = match expression at the start of a line, as in ^A. $ (Question) = match expression at the end of a line, as in A$. \ (Back Slash) = turn off the special meaning of the next character, as in \^. [ ] (Brackets) = match any one of the enclosed characters, as in [aeiou]. Use Hyphen "-" for a range, as in [0-9]. [^ ] = match any one character except those enclosed in [ ], as in [^0-9]. . (Period) = match a single character of any value, except end of line. * (Asterisk) = match zero or more of the preceding character or expression. \{x,y\} = match x to y occurrences of the preceding. \{x\} = match exactly x occurrences of the preceding. \{x,\} = match x or more occurrences of the preceding.
As often, text results are obfuscating and more clearly shown using the bash utility command column -t after transposition
Counts for sample and all bwa mem analyses
# results from the 10% data CATEGORY FIRST_OF_PAIR SECOND_OF_PAIR PAIR TOTAL_READS 749121 749121 1498242 PF_READS 749121 749121 1498242 PCT_PF_READS 1 1 1 PF_NOISE_READS 0 0 0 PF_READS_ALIGNED 748911 747489 1496400 PCT_PF_READS_ALIGNED 0.99972 0.997821 0.998771 PF_ALIGNED_BASES 74540196 73950911 148491107 PF_HQ_ALIGNED_READS 741931 740277 1482208 PF_HQ_ALIGNED_BASES 73924309 73332239 147256548 PF_HQ_ALIGNED_Q20_BASES 72705827 71322477 144028304 PF_HQ_MEDIAN_MISMATCHES 0 0 0 PF_MISMATCH_RATE 0.003886 0.005246 0.004563 PF_HQ_ERROR_RATE 0.003576 0.004933 0.004251 PF_INDEL_RATE 0.000281 0.000273 0.000277 MEAN_READ_LENGTH 100 100 100 READS_ALIGNED_IN_PAIRS 747339 747339 1494678 PCT_READS_ALIGNED_IN_PAIRS 0.997901 0.999799 0.998849 BAD_CYCLES 0 0 0 STRAND_BALANCE 0.499731 0.500374 0.500052 PCT_CHIMERAS 0.008306 0.008382 0.008344 PCT_ADAPTER 0 0 0 SAMPLE LIBRARY READ_GROUP # and the corresponding results from the full data CATEGORY FIRST_OF_PAIR SECOND_OF_PAIR PAIR TOTAL_READS 7495097 7495097 14990194 PF_READS 7495097 7495097 14990194 PCT_PF_READS 1 1 1 PF_NOISE_READS 0 0 0 PF_READS_ALIGNED 7493007 7478807 14971814 PCT_PF_READS_ALIGNED 0.999721 0.997827 0.998774 PF_ALIGNED_BASES 745813512 739887065 1485700577 PF_HQ_ALIGNED_READS 7423772 7406991 14830763 PF_HQ_ALIGNED_BASES 739710392 733740578 1473450970 PF_HQ_ALIGNED_Q20_BASES 727531910 713698106 1441230016 PF_HQ_MEDIAN_MISMATCHES 0 0 0 PF_MISMATCH_RATE 0.003882 0.005212 0.004544 PF_HQ_ERROR_RATE 0.003577 0.004906 0.004239 PF_INDEL_RATE 0.00028 0.000275 0.000278 MEAN_READ_LENGTH 100 100 100 READS_ALIGNED_IN_PAIRS 7477339 7477339 14954678 PCT_READS_ALIGNED_IN_PAIRS 0.997909 0.999804 0.998855 BAD_CYCLES 0 0 0 STRAND_BALANCE 0.500233 0.499847 0.50004 PCT_CHIMERAS 0.008204 0.008291 0.008248 PCT_ADAPTER 0 0 0 SAMPLE LIBRARY READ_GROUP
apply Picard CollectMultipleMetrics
Alternatively, one can run 4 of the 5 Picard metric QC commands in one go by using the global Picard command CollectMultipleMetrics that will generate PDF plots using R (CollectGcBiasMetrics.jar should be run separately and with a coordinate sorted BAM as input).
#! /usr/bin/env bash # script: 'picard_CollectMultipleMetrics.sh' ## ©SP-BITS, 2013 v1.0 # last edit: 2016-02-19 # Picard Version: 2.1.0 infolder=hg19_bwa-mapping outfolder=bwa-mappingQC mkdir -p ${outfolder} ref=ref/HiSeq_UCSC_hg19.fa # 10% sample infile=shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam # full data # infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam # OPT: PROGRAM={CollectAlignmentSummaryMetrics, CollectInsertSizeMetrics, # QualityScoreDistribution, MeanQualityByCycle} echo "# processing ${infile}" # run in parallel if the computer has enough capacity ( java -Xmx4g -jar $PICARD/picard.jar CollectMultipleMetrics \ R=${ref} \ I=${infolder}/${infile} \ O=${outfolder}/${infile%%.bam} \ PROGRAM=CollectAlignmentSummaryMetrics \ PROGRAM=CollectInsertSizeMetrics \ PROGRAM=QualityScoreDistribution \ PROGRAM=MeanQualityByCycle \ 2>${outfolder}/${infile%%.bam}_CollectMultipleMetrics.err ) echo
Picard results are stored in separate PDF files (tip: use 'evince filename.pdf' in terminal to open PDF files)
how can I convert all those PDF files to lightweight PNG?
for p in *.pdf; do convert -density 300 -depth 8 -quality 100 -resize 25% $p ${p%%.pdf}.png; done
Even better, create a custom function to ease this and add it to your .myfunctions home file (done!)
pdf2png () { # convert PDF to PNG (all pages with suffix) using ImageMagick. # take PDF file from 1 # use PDF file name for PNG with extension swap # or take PNG file name from $2 if [ $# -ge 1 ] then convert -density 300 -depth 8 -quality 100 -resize 25% ${1} ${2:-${1%%.pdf}.png}; else echo "usage: pdf2png <input.pdf> <opt:output.png>" fi }
plots for 10pc samples and full data | |
---|---|
As seen above, both sample and full data give very similar results for all metrics.
Perform integrated BAM QC with Qualimap
Like what FastQC does for FastQ data, Qualimap can integrate several quality control steps of your BAM data in one go. Qualimap[2] is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI, online help page) for the shy and hurried among you and a command-line interface (online help page) to facilitate the quality control of alignment sequencing data (PDF manual link). We next will show the use of the CLI version for basic BAM QC but as seen in the hidden documentation below, Qualimap proposes many more very attractive tools that are left for you to explore.
qualimap command line help
################################# ## To show available tools use command: qualimap --help QualiMap v.2.2 Built on 2016-01-29 12:10 usage: qualimap <tool> [options] To launch GUI leave <tool> empty. Available tools: bamqc Evaluate NGS mapping to a reference genome rnaseq Evaluate RNA-seq alignment data counts Counts data analysis (further RNA-seq data evaluation) multi-bamqc Compare QC reports from multiple NGS mappings clustering Cluster epigenomic signals comp-counts Compute feature counts Special arguments: --java-mem-size Use this argument to set Java memory heap size. Example: qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G ### BAM QC ####################################### ## The following command allows to perform BAM QC analysis: QualiMap v.2.2 Built on 2016-01-29 12:10 Selected tool: bamqc ERROR: Unrecognized option: --help usage: qualimap bamqc -bam <arg> [-c] [-gd <arg>] [-gff <arg>] [-hm <arg>] [-ip] [-nr <arg>] [-nt <arg>] [-nw <arg>] [-oc <arg>] [-os] [-outdir <arg>] [-outfile <arg>] [-outformat <arg>] [-p <arg>] [-sd] [-sdmode <arg>] -bam <arg> Input mapping file in BAM format -c,--paint-chromosome-limits Paint chromosome limits inside charts -gd,--genome-gc-distr <arg> Species to compare with genome GC distribution. Possible values: HUMAN - hg19; MOUSE - mm9(default), mm10 -gff,--feature-file <arg> Feature file with regions of interest in GFF/GTF or BED format -hm <arg> Minimum size for a homopolymer to be considered in indel analysis (default is 3) -ip,--collect-overlap-pairs Activate this option to collect statistics of overlapping paired-end reads -nr <arg> Number of reads analyzed in a chunk (default is 1000) -nt <arg> Number of threads (default is 24) -nw <arg> Number of windows (default is 400) -oc,--output-genome-coverage <arg> File to save per base non-zero coverage. Warning: large files are expected for large genomes -os,--outside-stats Report information for the regions outside those defined by feature-file (ignored when -gff option is not set) -outdir <arg> Output folder for HTML report and raw data. -outfile <arg> Output file for PDF report (default value is report.pdf). -outformat <arg> Format of the ouput report (PDF or HTML, default is HTML). -p,--sequencing-protocol <arg> Sequencing library protocol: strand-specific-forward, strand-specific-reverse or non-strand-specific (default) -sd,--skip-duplicated Activate this option to skip duplicated alignments from the analysis. If the duplicates are not flagged in the BAM file, then they will be detected by Qualimap and can be selected for skipping. -sdmode,--skip-dup-mode <arg> Specific type of duplicated alignments to skip (if this option is activated). 0 : only flagged duplicates (default) 1 : only estimated by Qualimap 2 : both flagged and estimated Special arguments: --java-mem-size Use this argument to set Java memory heap size. Example: qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G ### Counts QC ########################################################## ## To perform counts QC analysis (evaluation of RNA-seq data) use the following command: QualiMap v.2.2 Built on 2016-01-29 12:10 Selected tool: counts ERROR: Unrecognized option: --help usage: qualimap counts [-c] -d <arg> [-i <arg>] [-k <arg>] [-outdir <arg>] [-outfile <arg>] [-outformat <arg>] [-R <arg>] [-s <arg>] -c,--compare Perform comparison of conditions. Currently 2 maximum is possible. -d,--data <arg> File describing the input data. Format of the file is a 4-column tab-delimited table. Column 1: sample name Column 2: condition of the sample Column 3: path to the counts data for the sample Column 4: index of the column with counts -i,--info <arg> Path to info file containing genes GC-content, length and type. -k,--threshold <arg> Threshold for the number of counts -outdir <arg> Output folder for HTML report and raw data. -outfile <arg> Output file for PDF report (default value is report.pdf). -outformat <arg> Format of the ouput report (PDF or HTML, default is HTML). -R,--rscriptpath <arg> Path to Rscript executable (by default it is assumed to be available from system $PATH) -s,--species <arg> Use built-in info file for the given species: HUMAN or MOUSE. Special arguments: --java-mem-size Use this argument to set Java memory heap size. Example: qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G ### Clustering ############################################## ## To perform clustering of epigenomic signals use the following command: QualiMap v.2.2 Built on 2016-01-29 12:10 Selected tool: clustering ERROR: Unrecognized option: --help usage: qualimap clustering [-b <arg>] [-c <arg>] -control <arg> [-expr <arg>] [-f <arg>] [-l <arg>] [-name <arg>] [-outdir <arg>] [-outfile <arg>] [-outformat <arg>] [-R <arg>] [-r <arg>] -regions <arg> -sample <arg> [-viz <arg>] -b,--bin-size <arg> Size of the bin (default is 100) -c,--clusters <arg> Comma-separated list of cluster sizes -control <arg> Comma-separated list of control BAM files -expr <arg> Name of the experiment -f,--fragment-length <arg> Smoothing length of a fragment -l <arg> Upstream offset (default is 2000) -name <arg> Comma-separated names of the replicates -outdir <arg> Output folder for HTML report and raw data. -outfile <arg> Output file for PDF report (default value is report.pdf). -outformat <arg> Format of the ouput report (PDF or HTML, default is HTML). -R,--rscriptpath <arg> Path to Rscript executable (by default it is assumed to be available from system $PATH) -r <arg> Downstream offset (default is 500) -regions <arg> Path to regions file -sample <arg> Comma-separated list of sample BAM files -viz <arg> Visualization type: heatmap or line Special arguments: --java-mem-size Use this argument to set Java memory heap size. Example: qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G ### Compute counts ###################################### ## To compute counts from mapping data use the following command: QualiMap v.2.2 Built on 2016-01-29 12:10 Selected tool: comp-counts ERROR: Unrecognized option: --help usage: qualimap comp-counts [-a <arg>] -bam <arg> -gtf <arg> [-id <arg>] [-out <arg>] [-p <arg>] [-pe] [-s] [-type <arg>] -a,--algorithm <arg> Counting algorithm: uniquely-mapped-reads(default) or proportional -bam <arg> Mapping file in BAM format -gtf <arg> Region file in GTF, GFF or BED format. If GTF format is provided, counting is based on attributes, otherwise based on feature name -id <arg> GTF-specific. Attribute of the GTF to be used as feature ID. Regions with the same ID will be aggregated as part of the same feature. Default: gene_id. -out <arg> Output file of coverage report. -p,--sequencing-protocol <arg> Sequencing library protocol: strand-specific-forward, strand-specific-reverse or non-strand-specific (default) -pe,--paired Setting this flag for paired-end experiments will result in counting fragments instead of reads -s,--sorted This flag indicates that the input file is already sorted by name. If not set, additional sorting by name will be performed. Only required for paired-end analysis. -type <arg> GTF-specific. Value of the third column of the GTF considered for counting. Other types will be ignored. Default: exon Special arguments: --java-mem-size Use this argument to set Java memory heap size. Example: qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G ## also see rnaseq and multi-bamqc
The Qualimap GUI is rather simple to use and returns integrated results similarly to FastQC.
A QC on the sample BAM files can be performed in the GUI or with a simple command
Current human GRCh37.75 GTF annotations were downloaded from ensEMBL and converted to hg19 notation
# run the following code from the base folder (~/bits/NGS_DNASeq-training) # get the data from the ensEMBL FTP and save it in the ref folder wget -P ref ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz # add chr and change MT to chrM and recompress zcat ref/Homo_sapiens.GRCh37.75.gtf.gz | awk '{print "chr"$0}' | sed 's/chrMT/chrM/g' | bgzip -c > ref/Homo_sapiens.hg19.gtf.gz # subset chr21 for our purpose and keep it in uncompressed zgrep "^chr21\b" ref/Homo_sapiens.hg19.gtf.gz > ref/Homo_sapiens.hg19_chr21.gtf
We now combine the chr21 GTF subset and both '_mdup.bam' files with Qualimap bamqc.
#! /usr/bin/env bash ## script: 'qualimap_bamqc.sh' ## ©SP-BITS, 2014 v1.0 # last edit: 2014-04-09 # required: # qualimap 0.8 # R dependencies #full chromosome sample #infolder=Final_Results/hg19_bwa-mapping/full_chromosome #outfolder=bwa-mapping-qualimapQC/full chromosome #10 pct sample # process all bam files in 'hg19_bwa-mapping' infolder=Final_Results/hg19_bwa-mapping outfolder=bwa-mapping-qualimapQC mkdir -p ${outfolder} mygtf=ref/Homo_sapiens.hg19_chr21.gtf for b in ${infolder}/*_mdup.bam; do outname=$(basename ${b} .bam) echo "# processing ${b}" # read the full help to find more about other parameters # we output here to HTML but PDF is also available for portability and compactness # we restrict the analysis to chr21 using annotation data from the Qualimap site ## Human genome annotations from Ensembl database (v.64) qualimap bamqc --java-mem-size=4G \ -bam ${b} \ -gd HUMAN \ -c \ -gff ${mygtf} \ -outdir ${outfolder}/${outname}_qualimap-results \ -outformat HTML # for PDF, replace last line by # -outformat PDF # -outfile ${outname}_chr21_report.pdf \ done
Results of Qualimap analysis of 'all' chr21 mappings
download exercise files
Download exercise files here
References:
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.2 | NGS-Var Exercise.4 ]