NGS-Var Exercise.3

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.2 | NGS-Var Exercise.4 ]


BWA mapping QC using Samtools, Picard, and Qualimap


ex03_wf.png

Introduction

Each mapping is unique and can be biased by many unknown factors. Performing QC is primordial in each experiment to avoid carryover of errors to the downstream analyses.

summarise mapping results and perform mapping QC using Picard tools

As shown before, the samtools flagstat command provides very basic metrics. For more complex issues, other types of control are necessary and can be performed using specific picard tools.

apply Picard CollectGcBiasMetrics

Monotonous GC/AT rich regions lead to ambiguous mapping and eventually to no-mapping at all if they exceed the mapper capabilities. This is a well known problem in short read sequencing and although paired end reads limit this bias, it is, together together with repeats still responsible for the absence of coverage in highly repetitive regions, telomers, and centromers. The following QC run will show the limits of mappability associated with GC content for a given BAM file and possibly identify a technical issue.

CollectGcBiasMetrics is a High level metrics that capture how biased the coverage in a certain BAM file is.

#! /usr/bin/env bash
## script: 'picard_CollectGcBiasMetrics.sh'
## ©SP-BITS, 2013 v1.0
# last edit: 2016-02-19
# Picard Version: 2.1.0
 
infolder=hg19_bwa-mapping
outfolder=bwa-mappingQC
mkdir -p ${outfolder}
 
ref=ref/HiSeq_UCSC_hg19.fa
 
# 10% sample
infile=shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam
 
# full data
# infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam
 
echo "# processing ${infile}"
 
# send errors log file with '2>'
java -jar $PICARD/picard.jar CollectGcBiasMetrics \
	VERBOSITY=ERROR \
	R=${ref} \
	I=${infolder}/${infile} \
	O=${outfolder}/${infile%%.bam}_GCbias.txt \
	CHART=${outfolder}/${infile%%.bam}_GCbias.pdf \
	S=${outfolder}/${infile%%.bam}_GCbias_summary.txt \
	2>${outfolder}/${infile%%.bam}_CollectGcBiasMetrics.err
echo

The CollectGcBiasMetrics program produces 2 text files and a corresponding graph in pdf format.

comparison for sample & all datasets
shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem_mdup_GCbias.png
shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup_GCbias.png

apply Picard CollectAlignmentSummaryMetrics

This collects High level metrics about the alignment of reads within a SAM file. The result is a table that can be transposed for better readability (below)

#! /usr/bin/env bash
## script: 'picard_CollectAlignmentSummaryMetrics.sh'
## ©SP-BITS, 2013 v1.0
# last edit: 2016-02-19
# Picard Version: 2.1.0
 
infolder=hg19_bwa-mapping
outfolder=bwa-mappingQC
mkdir -p ${outfolder}
 
ref=ref/HiSeq_UCSC_hg19.fa
 
# 10% sample
infile=shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam
 
# full data
# infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam
 
echo "# processing ${infile}"
 
# collect information
(java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
        R=${ref} \
        I=${infolder}/${infile} \
        O=${outfolder}/${infile%%.bam}_AlignmentSummaryMetrics.txt \
        2>${outfolder}/${infile%%.bam}_CollectAlignmentSummaryMetrics.err)
 
# ignore header and empty lines and rotate the remaining content
grep -v -E '(^#|^$)' \
        ${outfolder}/${infile%%.bam}_AlignmentSummaryMetrics.txt | \
        transpose -t | \
        column -t > \
        ${outfolder}/tr_${infile%%.bam}_AlignmentSummaryMetrics.txt
echo

what did we learn here?

Cli tools.png 'grep -E' (--extended-regexp) allows complex querries

  • eg: grep -v -E '(^#|^$)' will ignore '-v' lines that start with either '#' OR are empty '^$' lines
  • '^$' means here starting '^' and ending '$' with nothing in between

more regular expression symbols

Cli tools.png this list is not exhaustive but already good to know
^ (Caret)        =    match expression at the start of a line, as in ^A.
$ (Question)     =    match expression at the end of a line, as in A$.
\ (Back Slash)   =    turn off the special meaning of the next character, as in \^.
[ ] (Brackets)   =    match any one of the enclosed characters, as in [aeiou].
                      Use Hyphen "-" for a range, as in [0-9].
[^ ]             =    match any one character except those enclosed in [ ], as in [^0-9].
. (Period)       =    match a single character of any value, except end of line.
* (Asterisk)     =    match zero or more of the preceding character or expression.
\{x,y\}          =    match x to y occurrences of the preceding.
\{x\}            =    match exactly x occurrences of the preceding.
\{x,\}           =    match x or more occurrences of the preceding.

As often, text results are obfuscating and more clearly shown using the bash utility command column -t after transposition

Counts for sample and all bwa mem analyses

# results from the 10% data
CATEGORY                    FIRST_OF_PAIR  SECOND_OF_PAIR  PAIR
TOTAL_READS                 749121         749121          1498242
PF_READS                    749121         749121          1498242
PCT_PF_READS                1              1               1
PF_NOISE_READS              0              0               0
PF_READS_ALIGNED            748911         747489          1496400
PCT_PF_READS_ALIGNED        0.99972        0.997821        0.998771
PF_ALIGNED_BASES            74540196       73950911        148491107
PF_HQ_ALIGNED_READS         741931         740277          1482208
PF_HQ_ALIGNED_BASES         73924309       73332239        147256548
PF_HQ_ALIGNED_Q20_BASES     72705827       71322477        144028304
PF_HQ_MEDIAN_MISMATCHES     0              0               0
PF_MISMATCH_RATE            0.003886       0.005246        0.004563
PF_HQ_ERROR_RATE            0.003576       0.004933        0.004251
PF_INDEL_RATE               0.000281       0.000273        0.000277
MEAN_READ_LENGTH            100            100             100
READS_ALIGNED_IN_PAIRS      747339         747339          1494678
PCT_READS_ALIGNED_IN_PAIRS  0.997901       0.999799        0.998849
BAD_CYCLES                  0              0               0
STRAND_BALANCE              0.499731       0.500374        0.500052
PCT_CHIMERAS                0.008306       0.008382        0.008344
PCT_ADAPTER                 0              0               0
SAMPLE
LIBRARY
READ_GROUP
 
# and the corresponding results from the full data
CATEGORY                    FIRST_OF_PAIR  SECOND_OF_PAIR  PAIR
TOTAL_READS                 7495097        7495097         14990194
PF_READS                    7495097        7495097         14990194
PCT_PF_READS                1              1               1
PF_NOISE_READS              0              0               0
PF_READS_ALIGNED            7493007        7478807         14971814
PCT_PF_READS_ALIGNED        0.999721       0.997827        0.998774
PF_ALIGNED_BASES            745813512      739887065       1485700577
PF_HQ_ALIGNED_READS         7423772        7406991         14830763
PF_HQ_ALIGNED_BASES         739710392      733740578       1473450970
PF_HQ_ALIGNED_Q20_BASES     727531910      713698106       1441230016
PF_HQ_MEDIAN_MISMATCHES     0              0               0
PF_MISMATCH_RATE            0.003882       0.005212        0.004544
PF_HQ_ERROR_RATE            0.003577       0.004906        0.004239
PF_INDEL_RATE               0.00028        0.000275        0.000278
MEAN_READ_LENGTH            100            100             100
READS_ALIGNED_IN_PAIRS      7477339        7477339         14954678
PCT_READS_ALIGNED_IN_PAIRS  0.997909       0.999804        0.998855
BAD_CYCLES                  0              0               0
STRAND_BALANCE              0.500233       0.499847        0.50004
PCT_CHIMERAS                0.008204       0.008291        0.008248
PCT_ADAPTER                 0              0               0
SAMPLE
LIBRARY
READ_GROUP

apply Picard CollectMultipleMetrics

Alternatively, one can run 4 of the 5 Picard metric QC commands in one go by using the global Picard command CollectMultipleMetrics that will generate PDF plots using R (CollectGcBiasMetrics.jar should be run separately and with a coordinate sorted BAM as input).

#! /usr/bin/env bash
# script: 'picard_CollectMultipleMetrics.sh'
## ©SP-BITS, 2013 v1.0
# last edit: 2016-02-19
# Picard Version: 2.1.0
 
infolder=hg19_bwa-mapping
outfolder=bwa-mappingQC
mkdir -p ${outfolder}
 
ref=ref/HiSeq_UCSC_hg19.fa
 
# 10% sample
infile=shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam
 
# full data
# infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.bam
 
 
# OPT: PROGRAM={CollectAlignmentSummaryMetrics, CollectInsertSizeMetrics, 
#                       QualityScoreDistribution, MeanQualityByCycle} 
 
 
echo "# processing ${infile}" 
# run in parallel if the computer has enough capacity
( java -Xmx4g -jar $PICARD/picard.jar CollectMultipleMetrics \
	R=${ref} \
	I=${infolder}/${infile} \
	O=${outfolder}/${infile%%.bam} \
	PROGRAM=CollectAlignmentSummaryMetrics \
	PROGRAM=CollectInsertSizeMetrics \
	PROGRAM=QualityScoreDistribution \
	PROGRAM=MeanQualityByCycle \
	2>${outfolder}/${infile%%.bam}_CollectMultipleMetrics.err )
 
echo

Handicon.png Picard results are stored in separate PDF files (tip: use 'evince filename.pdf' in terminal to open PDF files)

how can I convert all those PDF files to lightweight PNG?

By using the great ImageMagick[1] convert command in a simple little loop
for p in *.pdf; do 
     convert -density 300 -depth 8 -quality 100 -resize 25% $p ${p%%.pdf}.png; 
done

Even better, create a custom function to ease this and add it to your .myfunctions home file (done!)

pdf2png ()
{
# convert PDF to PNG (all pages with suffix) using ImageMagick.
# take PDF file from 1
# use PDF file name for PNG with extension swap
# or take PNG file name from $2 
 
if [ $# -ge 1 ]
then
    convert -density 300 -depth 8 -quality 100 -resize 25% ${1} ${2:-${1%%.pdf}.png};    
else
    echo "usage: pdf2png <input.pdf> <opt:output.png>"
fi
}
plots for 10pc samples and full data
shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem_mdup.insert_size_histogram.png
shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.insert_size_histogram.png
shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem_mdup.quality_by_cycle.png
shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.quality_by_cycle.png
shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem_mdup.quality_distribution.png
shuffled_PE_NA18507_GAIIx_100_chr21_mem_mdup.quality_distribution.png

As seen above, both sample and full data give very similar results for all metrics.

Perform integrated BAM QC with Qualimap

start_analysis.png

Like what FastQC does for FastQ data, Qualimap can integrate several quality control steps of your BAM data in one go. Qualimap[2] is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI, online help page) for the shy and hurried among you and a command-line interface (online help page) to facilitate the quality control of alignment sequencing data (PDF manual link). We next will show the use of the CLI version for basic BAM QC but as seen in the hidden documentation below, Qualimap proposes many more very attractive tools that are left for you to explore.

qualimap command line help

#################################
## To show available tools use command:
 
qualimap --help
 
QualiMap v.2.2
Built on 2016-01-29 12:10
 
usage: qualimap <tool> [options]
 
To launch GUI leave <tool> empty.
 
Available tools:
 
    bamqc            Evaluate NGS mapping to a reference genome
    rnaseq           Evaluate RNA-seq alignment data
    counts           Counts data analysis (further RNA-seq data evaluation)
    multi-bamqc      Compare QC reports from multiple NGS mappings
    clustering       Cluster epigenomic signals
    comp-counts      Compute feature counts
 
Special arguments: 
 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
 
### BAM QC #######################################
## The following command allows to perform BAM QC analysis:
 
QualiMap v.2.2
Built on 2016-01-29 12:10
 
Selected tool: bamqc
 
ERROR: Unrecognized option: --help
 
usage: qualimap bamqc -bam <arg> [-c] [-gd <arg>] [-gff <arg>] [-hm <arg>] [-ip]
       [-nr <arg>] [-nt <arg>] [-nw <arg>] [-oc <arg>] [-os] [-outdir <arg>]
       [-outfile <arg>] [-outformat <arg>] [-p <arg>] [-sd] [-sdmode <arg>]
 -bam <arg>                           Input mapping file in BAM format
 -c,--paint-chromosome-limits         Paint chromosome limits inside charts
 -gd,--genome-gc-distr <arg>          Species to compare with genome GC
                                      distribution. Possible values: HUMAN -
                                      hg19; MOUSE - mm9(default), mm10
 -gff,--feature-file <arg>            Feature file with regions of interest in
                                      GFF/GTF or BED format
 -hm <arg>                            Minimum size for a homopolymer to be
                                      considered in indel analysis (default is
                                      3)
 -ip,--collect-overlap-pairs          Activate this option to collect statistics
                                      of overlapping paired-end reads
 -nr <arg>                            Number of reads analyzed in a chunk
                                      (default is 1000)
 -nt <arg>                            Number of threads (default is 24)
 -nw <arg>                            Number of windows (default is 400)
 -oc,--output-genome-coverage <arg>   File to save per base non-zero coverage.
                                      Warning: large files are expected for
                                      large genomes
 -os,--outside-stats                  Report information for the regions outside
                                      those defined by feature-file  (ignored
                                      when -gff option is not set)
 -outdir <arg>                        Output folder for HTML report and raw
                                      data.
 -outfile <arg>                       Output file for PDF report (default value
                                      is report.pdf).
 -outformat <arg>                     Format of the ouput report (PDF or HTML,
                                      default is HTML).
 -p,--sequencing-protocol <arg>       Sequencing library protocol:
                                      strand-specific-forward,
                                      strand-specific-reverse or
                                      non-strand-specific (default)
 -sd,--skip-duplicated                Activate this option to skip duplicated
                                      alignments from the analysis. If the
                                      duplicates are not flagged in the BAM
                                      file, then they will be detected by
                                      Qualimap and can be selected for skipping.
 -sdmode,--skip-dup-mode <arg>        Specific type of duplicated alignments to
                                      skip (if this option is activated).
                                      0 : only flagged duplicates (default)
                                      1 : only estimated by Qualimap
                                      2 : both flagged and estimated
 
 
 
Special arguments: 
 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
 
###  Counts QC  ##########################################################
## To perform counts QC analysis (evaluation of RNA-seq data) use the following command:
 
QualiMap v.2.2
Built on 2016-01-29 12:10
 
Selected tool: counts
 
ERROR: Unrecognized option: --help
 
usage: qualimap counts [-c] -d <arg> [-i <arg>] [-k <arg>] [-outdir <arg>]
       [-outfile <arg>] [-outformat <arg>] [-R <arg>] [-s <arg>]
 -c,--compare             Perform comparison of conditions. Currently 2 maximum
                          is possible.
 -d,--data <arg>          File describing the input data. Format of the file is
                          a 4-column tab-delimited table.
                          Column 1: sample name
                          Column 2: condition of the sample
                          Column 3: path to the counts data for the sample
                          Column 4: index of the column with counts
 -i,--info <arg>          Path to info file containing genes GC-content, length
                          and type.
 -k,--threshold <arg>     Threshold for the number of counts
 -outdir <arg>            Output folder for HTML report and raw data.
 -outfile <arg>           Output file for PDF report (default value is
                          report.pdf).
 -outformat <arg>         Format of the ouput report (PDF or HTML, default is
                          HTML).
 -R,--rscriptpath <arg>   Path to Rscript executable (by default it is assumed
                          to be available from system $PATH)
 -s,--species <arg>       Use built-in info file for the given species: HUMAN or
                          MOUSE.
 
 
 
Special arguments: 
 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
 
### Clustering ##############################################
## To perform clustering of epigenomic signals use the following command:
 
QualiMap v.2.2
Built on 2016-01-29 12:10
 
Selected tool: clustering
 
ERROR: Unrecognized option: --help
 
usage: qualimap clustering [-b <arg>] [-c <arg>] -control <arg> [-expr <arg>]
       [-f <arg>] [-l <arg>] [-name <arg>] [-outdir <arg>] [-outfile <arg>]
       [-outformat <arg>] [-R <arg>] [-r <arg>] -regions <arg> -sample <arg>
       [-viz <arg>]
 -b,--bin-size <arg>          Size of the bin (default is 100)
 -c,--clusters <arg>          Comma-separated list of cluster sizes
 -control <arg>               Comma-separated list of control BAM files
 -expr <arg>                  Name of the experiment
 -f,--fragment-length <arg>   Smoothing length of a fragment
 -l <arg>                     Upstream offset (default is 2000)
 -name <arg>                  Comma-separated names of the replicates
 -outdir <arg>                Output folder for HTML report and raw data.
 -outfile <arg>               Output file for PDF report (default value is
                              report.pdf).
 -outformat <arg>             Format of the ouput report (PDF or HTML, default
                              is HTML).
 -R,--rscriptpath <arg>       Path to Rscript executable (by default it is
                              assumed to be available from system $PATH)
 -r <arg>                     Downstream offset (default is 500)
 -regions <arg>               Path to regions file
 -sample <arg>                Comma-separated list of sample BAM files
 -viz <arg>                   Visualization type: heatmap or line
 
 
 
Special arguments: 
 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
 
### Compute counts ######################################
## To compute counts from mapping data use the following command:
 
QualiMap v.2.2
Built on 2016-01-29 12:10
 
Selected tool: comp-counts
 
ERROR: Unrecognized option: --help
 
usage: qualimap comp-counts [-a <arg>] -bam <arg> -gtf <arg> [-id <arg>] [-out
       <arg>] [-p <arg>] [-pe] [-s] [-type <arg>]
 -a,--algorithm <arg>             Counting algorithm:
                                  uniquely-mapped-reads(default) or proportional
 -bam <arg>                       Mapping file in BAM format
 -gtf <arg>                       Region file in GTF, GFF or BED format. If GTF
                                  format is provided, counting is based on
                                  attributes, otherwise based on feature name
 -id <arg>                        GTF-specific. Attribute of the GTF to be used
                                  as feature ID. Regions with the same ID will
                                  be aggregated as part of the same feature.
                                  Default: gene_id.
 -out <arg>                       Output file of coverage report.
 -p,--sequencing-protocol <arg>   Sequencing library protocol:
                                  strand-specific-forward,
                                  strand-specific-reverse or non-strand-specific
                                  (default)
 -pe,--paired                     Setting this flag for paired-end experiments
                                  will result in counting fragments instead of
                                  reads
 -s,--sorted                      This flag indicates that the input file is
                                  already sorted by name. If not set, additional
                                  sorting by name will be performed. Only
                                  required for paired-end analysis.
 -type <arg>                      GTF-specific. Value of the third column of the
                                  GTF considered for counting. Other types will
                                  be ignored. Default: exon
 
 
 
Special arguments: 
 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
 
## also see rnaseq and multi-bamqc

The Qualimap GUI is rather simple to use and returns integrated results similarly to FastQC.


output.png

A QC on the sample BAM files can be performed in the GUI or with a simple command


Current human GRCh37.75 GTF annotations were downloaded from ensEMBL and converted to hg19 notation

# run the following code from the base folder (~/bits/NGS_DNASeq-training)
# get the data from the ensEMBL FTP and save it in the ref folder
wget -P ref ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
 
# add chr and change MT to chrM and recompress
zcat ref/Homo_sapiens.GRCh37.75.gtf.gz | awk '{print "chr"$0}'  | sed 's/chrMT/chrM/g' | bgzip -c > ref/Homo_sapiens.hg19.gtf.gz
 
# subset chr21 for our purpose and keep it in uncompressed
zgrep "^chr21\b" ref/Homo_sapiens.hg19.gtf.gz > ref/Homo_sapiens.hg19_chr21.gtf

We now combine the chr21 GTF subset and both '_mdup.bam' files with Qualimap bamqc.

#! /usr/bin/env bash
## script: 'qualimap_bamqc.sh'
## ©SP-BITS, 2014 v1.0
# last edit: 2014-04-09
 
# required: 
# qualimap 0.8
# R dependencies
 
#full chromosome sample
#infolder=Final_Results/hg19_bwa-mapping/full_chromosome
#outfolder=bwa-mapping-qualimapQC/full chromosome
 
#10 pct sample
# process all bam files in 'hg19_bwa-mapping'
infolder=Final_Results/hg19_bwa-mapping
outfolder=bwa-mapping-qualimapQC
mkdir -p ${outfolder}
 
mygtf=ref/Homo_sapiens.hg19_chr21.gtf
 
for b in ${infolder}/*_mdup.bam; do
outname=$(basename ${b} .bam)
 
echo "# processing ${b}"
 
# read the full help to find more about other parameters
# we output here to HTML but PDF is also available for portability and compactness
# we restrict the analysis to chr21 using annotation data from the Qualimap site
## Human genome annotations from Ensembl database (v.64) 
qualimap bamqc --java-mem-size=4G \
	-bam ${b} \
	-gd HUMAN \
	-c \
	-gff ${mygtf} \
	-outdir ${outfolder}/${outname}_qualimap-results \
	-outformat HTML
 
# for PDF, replace last line by
#	-outformat PDF
#	-outfile ${outname}_chr21_report.pdf \
 
done

download exercise files

Download exercise files here

Use the right application to open the files present in ex3-files

References:
  1. http://www.imagemagick.org/script/index.php
  2. http://qualimap.bioinfo.cipf.es

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.2 | NGS-Var Exercise.4 ]