NGS Exercise.3c

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 | NGS Exercise.3 | NGS Exercise.3b | NGS Exercise.3d | NGS Exercise.3e ]
# updated 2014 version


Align paired end reads to the human reference genome hg19 using Bowtie2


ex03c_wf.png

AIM of this exercise

During the training session, the rightful question was asked of why BWA was chosen over the faster and more recent Bowtie2. After searching for elements of answer (see Which mapper should I use for DNA sequencing?), we decided to reproduce the BWA analysis on the full read set using Bowtie2 and compare the results of both experiments; This page presents the detailed results of such analysis using ~7.5million real read-pairs rather than synthetic read sets preferred by most bioinformaticians.


Technical.png We acknowledge the fact that our read-set (obtained from the Illumina site) were extracted from a BAM file obtained using CASAVA and therefore do not contain difficult to map or unmapped sequences. This strategy was chosen in order to speed up our experiment and start from a smaller subset of reads while keeping a rather high genome coverage of the human chr21 which is key to a good variant calling step


Any comment on our approach or discussion about its validity are MOST WELCOME to help us improve this document.

Method & Results

BWA is not the only 'mapper' able to quickly align short reads to a reference genome. The second most popular mapper is Bowtie2; we present here the results of full genom emapping of the chr21 reads to compare BWA and Bowtie2 results and verify that the two mappers obtain comparable results with the data used during this training. We proceed with full set mapping only in order to maximize the comparison with BWA data.

Technical.png A lot of the code presented in this exercise was adapted from previous 'BWA' pages. Please refer to the original pages for comments and explanatory text

Map all chr21 paired reads to the human genome using Bowtie2

list of Bowtie 2 version 2.2.2 commands

Bowtie 2 version 2.2.2 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
Usage:
  bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]

  <bt2-idx>  Index filename prefix (minus trailing .X.bt2).
             NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.
  <m1>       Files with #1 mates, paired with files in <m2>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <m2>       Files with #2 mates, paired with files in <m1>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <r>        Files with unpaired reads.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <sam>      File for SAM output (default: stdout)

  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.

Options (defaults in parentheses):

 Input:
  -q                 query input files are FASTQ .fq/.fastq (default)
  --qseq             query input files are in Illumina's qseq format
  -f                 query input files are (multi-)FASTA .fa/.mfa
  -r                 query input files are raw one-sequence-per-line
  -c                 <m1>, <m2>, <r> are sequences themselves, not files
  -s/--skip <int>    skip the first <int> reads/pairs in the input (none)
  -u/--upto <int>    stop after first <int> reads/pairs (no limit)
  -5/--trim5 <int>   trim <int> bases from 5'/left end of reads (0)
  -3/--trim3 <int>   trim <int> bases from 3'/right end of reads (0)
  --phred33          qualities are Phred+33 (default)
  --phred64          qualities are Phred+64
  --int-quals        qualities encoded as space-delimited integers

 Presets:                 Same as:
  For --end-to-end:
   --very-fast            -D 5 -R 1 -N 0 -L 22 -i S,0,2.50
   --fast                 -D 10 -R 2 -N 0 -L 22 -i S,0,2.50
   --sensitive            -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default)
   --very-sensitive       -D 20 -R 3 -N 0 -L 20 -i S,1,0.50

  For --local:
   --very-fast-local      -D 5 -R 1 -N 0 -L 25 -i S,1,2.00
   --fast-local           -D 10 -R 2 -N 0 -L 22 -i S,1,1.75
   --sensitive-local      -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default)
   --very-sensitive-local -D 20 -R 3 -N 0 -L 20 -i S,1,0.50

 Alignment:
  -N <int>           max # mismatches in seed alignment; can be 0 or 1 (0)
  -L <int>           length of seed substrings; must be >3, <32 (22)
  -i <func>          interval between seed substrings w/r/t read len (S,1,1.15)
  --n-ceil <func>    func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)
  --dpad <int>       include <int> extra ref chars on sides of DP table (15)
  --gbar <int>       disallow gaps within <int> nucs of read extremes (4)
  --ignore-quals     treat all quality values as 30 on Phred scale (off)
  --nofw             do not align forward (original) version of read (off)
  --norc             do not align reverse-complement version of read (off)
  --no-1mm-upfront   do not allow 1 mismatch alignments before attempting to
                     scan for the optimal seeded alignments
  --end-to-end       entire read must align; no clipping (on)
   OR
  --local            local alignment; ends might be soft clipped (off)

 Scoring:
  --ma <int>         match bonus (0 for --end-to-end, 2 for --local)
  --mp <int>         max penalty for mismatch; lower qual = lower penalty (6)
  --np <int>         penalty for non-A/C/G/Ts in read/ref (1)
  --rdg <int>,<int>  read gap open, extend penalties (5,3)
  --rfg <int>,<int>  reference gap open, extend penalties (5,3)
  --score-min <func> min acceptable alignment score w/r/t read length
                     (G,20,8 for local, L,-0.6,-0.6 for end-to-end)

 Reporting:
  (default)          look for multiple alignments, report best, with MAPQ
   OR
  -k <int>           report up to <int> alns per read; MAPQ not meaningful
   OR
  -a/--all           report all alignments; very slow, MAPQ not meaningful

 Effort:
  -D <int>           give up extending after <int> failed extends in a row (15)
  -R <int>           for reads w/ repetitive seeds, try <int> sets of seeds (2)

 Paired-end:
  -I/--minins <int>  minimum fragment length (0)
  -X/--maxins <int>  maximum fragment length (500)
  --fr/--rf/--ff     -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)
  --no-mixed         suppress unpaired alignments for paired reads
  --no-discordant    suppress discordant alignments for paired reads
  --no-dovetail      not concordant when mates extend past each other
  --no-contain       not concordant when one mate alignment contains other
  --no-overlap       not concordant when mates overlap at all

 Output:
  -t/--time          print wall-clock time taken by search phases
  --un <path>           write unpaired reads that didn't align to <path>
  --al <path>           write unpaired reads that aligned at least once to <path>
  --un-conc <path>      write pairs that didn't align concordantly to <path>
  --al-conc <path>      write pairs that aligned concordantly at least once to <path>
  (Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.
  --un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)
  --quiet            print nothing to stderr except serious errors
  --met-file <path>  send metrics to file at <path> (off)
  --met-stderr       send metrics to stderr (off)
  --met <int>        report internal counters & metrics every <int> secs (1)
  --no-head          supppress header lines, i.e. lines starting with @
  --no-sq            supppress @SQ header lines
  --rg-id <text>     set read group id, reflected in @RG line and RG:Z: opt field
  --rg <text>        add <text> ("lab:value") to @RG line of SAM header.
                     Note: @RG line only printed when --rg-id is set.
  --omit-sec-seq     put '*' in SEQ and QUAL fields for secondary alignments.

 Performance:
  -p/--threads <int> number of alignment threads to launch (1)
  --reorder          force SAM output order to match order of input reads
  --mm               use memory-mapped I/O for index; many 'bowtie's can share

 Other:
  --qc-filter        filter out reads that are bad according to QSEQ filter
  --seed <int>       seed for random number generator (0)
  --non-deterministic seed rand. gen. arbitrarily instead of using read attributes
  --version          print version information and quit
  -h/--help          print this usage message

prepare the reference genome for Bowtie2 alignment

Bowtie2, just like any other mapper using a 'hash' to map short reads, requires that we provide the reference genome in the correct format. Several reference genome hash files for Bowtie2 can be obtained from several places including the Bowtie2 pages (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml[1]) and the Illumina iGenome site (http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn[2]). We will instead here build our own index files in order to use a reference genome identical to that used in the former exercises (HiSeq_UCSC_hg19.fa). A very good documentation is present on the Bowtie2 site that you should read before proceeding with the code reproduced below.

bowtie2-build command parameters

Bowtie 2 version 2.2.2 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>
    reference_in            comma-separated list of files with ref sequences
    bt2_index_base          write bt2 data to files with this dir/basename
*** Bowtie 2 indexes work only with v2 (not v1).  Likewise for v1 indexes. ***
Options:
    -f                      reference files are Fasta (default)
    -c                      reference sequences given on cmd line (as
                            <reference_in>)
    --large-index           force generated index to be 'large', even if ref
                            has fewer than 4 billion nucleotides
    -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
    -p/--packed             use packed strings internally; slower, less memory
    --bmax <int>            max bucket sz for blockwise suffix-array builder
    --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)
    --dcv <int>             diff-cover period for blockwise (default: 1024)
    --nodc                  disable diff-cover (algorithm becomes quadratic)
    -r/--noref              don't build .3/.4 index files
    -3/--justref            just build .3/.4 index files
    -o/--offrate <int>      SA is sampled every 2^<int> BWT chars (default: 5)
    -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)
    --seed <int>            seed for random number generator
    -q/--quiet              verbose output (for debugging)
    -h/--help               print detailed description of tool and its options
    --usage                 print this usage message
    --version               print version information and quit

Build the Bowtie2 index files from HiSeq_UCSC_hg19.fa

The command creates hash files from the fasta sequence of the reference genome. All files are named by the fasta genome file with different added extensions. These files could be obtained premade from the above cited links but are created for the sake of training you in the underlying commands. As always, you are warmly advised to read the documentation. A new ENV variable waa created in .bashrc and exported that points to the default location for bowtie indexes; this variable was named BOWTIE2_INDEXES as recommended by the bowtie documentation and is used below.

# define variables
infasta=ref/HiSeq_UCSC_hg19.fa
idxfolder=$BOWTIE2_INDEXES
 
# we use the fasta file basename to name the index
name=$(basename ${infasta} ".fa")
 
# create bowtie2 index next to fasta file
bowtie2-build -f ${infasta} ${idx}/${name}
    2>${idxfolder}/${name}_bowtie2-build.err

Control if the obtained Bowtie2 index files are valid

Controlling the obtained indexes before using them is always a good idea.

Handicon.png The recent version 2.2.2 of bowtie2 contained a bug that prevented inspect to work. Downloading and building from github is/was necessary to regain that functionality

bowtie2-inspect command parameters

Bowtie 2 version 2.2.2 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
Usage: bowtie2-inspect [options]* <bt2_base>
  <bt2_base>         bt2 filename minus trailing .1.bt2/.2.bt2
 
  By default, prints FASTA records of the indexed nucleotide sequences to
  standard out.  With -n, just prints names.  With -s, just prints a summary of
  the index parameters and sequences.  With -e, preserves colors if applicable.
 
Options:
  --large-index      force inspection of the 'large' index, even if a
                     'small' one is present.
  -a/--across <int>  Number of characters across in FASTA output (default: 60)
  -n/--names         Print reference sequence names only
  -s/--summary       Print summary incl. ref names, lengths, index properties
  -e/--bt2-ref      Reconstruct reference from .bt2 (slow, preserves colors)
  -v/--verbose       Verbose output (for debugging)
  -h/--help          print detailed description of tool and its options
  --help             print this usage message
# define variables
idxfolder=$BOWTIE2_INDEXES
ref=${idxfolder}/HiSeq_UCSC_hg19
 
# create bowtie2 index next to fasta file
bowtie2-inspect -s ${ref} 2>bowtie2-inspect_HiSeq_UCSC_hg19.err

bowtie2-inspect results

Flags	1
Reverse flags	5
Colorspace	0
2.0-compatible	1
SA-Sample	1 in 16
FTab-Chars	10
Sequence-1	chrM	16571
Sequence-2	chr1	249250621
Sequence-3	chr2	243199373
Sequence-4	chr3	198022430
Sequence-5	chr4	191154276
Sequence-6	chr5	180915260
Sequence-7	chr6	171115067
Sequence-8	chr7	159138663
Sequence-9	chr8	146364022
Sequence-10	chr9	141213431
Sequence-11	chr10	135534747
Sequence-12	chr11	135006516
Sequence-13	chr12	133851895
Sequence-14	chr13	115169878
Sequence-15	chr14	107349540
Sequence-16	chr15	102531392
Sequence-17	chr16	90354753
Sequence-18	chr17	81195210
Sequence-19	chr18	78077248
Sequence-20	chr19	59128983
Sequence-21	chr20	63025520
Sequence-22	chr21	48129895
Sequence-23	chr22	51304566
Sequence-24	chrX	155270560
Sequence-25	chrY	59373566

This looks OK.

Map paired reads to the reference genome using Bowtie2

In order to compare Bowtie2 and BWA, we should ideally use the same parameters in both. Obviously, this is not possible as the two programs are different and do not even have the same parameters. We will therefore do for Bowtie2 as done for BWA and use standard parameters. It is very likely that fine-tuning the cutoffs in either program can lead to optimal mappings not reached with standard parameters used in this demo. However, this would require deep knowledge which we do not have, we will therefore trust the makers and apply what has been defined by them to be adequate for standard conditions. We will however show some of these default parameters to tease you, please have a look in the help text at the variety of cutoffs you could apply.

Handicon.png In order to know what the standard parameters are, run the dry command 'bowtie2' (do the same for 'bwa mem') and look at the statements indicating default values in each

Technical.png By default, Bowtie2 reports only the best alignment of the read (based on the mapping quality\). This can be changes using -a or -k but will considerably slow down the mapping.

#! /usr/bin/env bash
## script: 'mapping_bowtie2.sh'
## ©SP-BITS, 2014 v1.0
# last edit: 2014-05-27
 
# required: 
# bowtie2 version: 2.2.2 (github)
 
# define variables
idxfolder=$BOWTIE2_INDEXES
ref=${idxfolder}/HiSeq_UCSC_hg19
 
# full trainer data
# infolder=Final_Results/reads_results
# f=shuffled_PE_NA18507_GAIIx_100_chr21
# pfx=all
# outfolder=hg19_bowtie2-mapping/full_chromosome
 
# 10 pc sample
infolder=Final_Results/reads
f=shuffled_10pc_PE_NA18507_GAIIx_100_chr21
pfx=sample
fq1=${infolder}/${f}_2_1.fq.gz
fq2=${infolder}/${f}_2_2.fq.gz
 
outfolder=hg19_bowtie2-mapping
mkdir -p ${outfolder}
 
# output name
bowtiepe="bowtie_mapped-reads"
 
# using 'nthr' processors in parallel
# computers with more than 4GB RAM will be primarily limited by cpu 
# you need ~3.5GB of RAM to store the full human reference but multithreading uses the same shared copy
nthr=4
 
echo
echo "# mapping paired reads with **bowtie2**"
 
# we report here some of the options even they are kept as default
# -I 0                 (min insert size)
# -X 500            (max insert size)
# --end-to-end (default)
# --sensitive      -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default)
 
## Special parameters
# --seed 2014   (to reproduce these results later on => this is only interesting in the context of a training or for benchmarking)
# -u 100            (to stop after 100 mapping and test that you command is correct => remove after the first test run)
 
cmd="bowtie2 -p ${nthr} \
    -x ${ref} \
    -q \
    -1 ${fq1} \
    -2 ${fq2} \
    --phred33 \
    --fr \
    -I 0 \
    -X 500 \
    --un-gz ${outfolder}/bowtie_unmapped-reads.sam.gz \
    --end-to-end \
    --sensitive \
    --seed 2014 \
    -S  ${outfolder}/${bowtiepe}.sam \
    -u 100"
 
# show it
echo "# ${cmd}"
 
# run it
eval "${cmd} 2>${outfolder}/bowtie2-mapping.err"
 
echo "# finished mapping ${f} reads with Bowtie2"
echo
echo "## the resulting SAM header is:"
head -100 ${outfolder}/${bowtiepe}.sam | grep "^@"

bowtie2 summarized counts

7495097 reads; of these:
  7495097 (100.00%) were paired; of these:
    132563 (1.77%) aligned concordantly 0 times
    6688102 (89.23%) aligned concordantly exactly 1 time
    674432 (9.00%) aligned concordantly >1 times
    ----
    132563 pairs aligned concordantly 0 times; of these:
      9036 (6.82%) aligned discordantly 1 time
    ----
    123527 pairs aligned 0 times concordantly or discordantly; of these:
      247054 mates make up the pairs; of these:
        131691 (53.30%) aligned 0 times
        78251 (31.67%) aligned exactly 1 time
        37112 (15.02%) aligned >1 times
99.12% overall alignment rate

Post process and clean the mapping results

Convert mapped reads from SAM to BAM, sort, and index using samtools

A @RG line is first added to the SAM data using bamutils bam polishbam to describe the library and read group. Then the SAM is sorted, converted to BAM and the BAM version indexed using a Picard command. Finally, samtools flagstat is applied to get basic BAM counts (compare them to those reported by bowtie2 above.

bamutils bam polishbam command

Required parameters:
        -i/--in : input BAM file
        -o/--out : output BAM file
   Optional parameters:
        -v : turn on verbose mode
        -l/--log : writes logfile. <outBamFile>.log will be used if value is unspecified
        --HD : add @HD header line
        --RG : add @RG header line
        --PG : add @PG header line
        -f/--fasta : fasta reference file to compute MD5sums and update SQ tags
        --AS : AS tag for genome assembly identifier
        --UR : UR tag for @SQ tag (if different from --fasta)
        --SP : SP tag for @SQ tag
        --checkSQ : check the consistency of SQ tags (SN and LN) with existing header lines. Must be used with --fasta option
        PhoneHome:
                --noPhoneHome       : disable PhoneHome (default enabled)
                --phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)
#! /usr/bin/env bash
## script: 'post-process_bowtie2.sh'
## ©SP-BITS, 2014 v1.0
# last edit: 2014-05-27
 
# required:
# bamutils Version: 1.0.11
# Picard Version: 1.112+
# samtools version 0.1.19+
 
 
infolder=hg19_bowtie2-mapping
bowtiepe="bowtie_mapped-reads"
 
echo "# add a RG fields to the data"
 
rgstring=@RG$'\t'ID:NA18507$'\t'LB:lib-NA18507$'\t'PU:unkn-0.0$'\t'PL:ILLUMINA$'\t'SM:GAIIx-chr21-Bowtie2
 
# clean up initial file after success to save space
bam polishBam --RG "${rgstring}" \
	-i ${outfolder}/${bowtiepe}.sam \
	-o ${outfolder}/${bowtiepe}_rg.sam \
	-l ${outfolder}/polishBam_${bowtiepe}.log && rm ${outfolder}/${bowtiepe}.sam
 
echo
echo "# sort, compress, and index"
 
##### convert to sorted bam, index & cleanup
java -Xmx4g -jar $PICARD/SortSam.jar \
	I=${outfolder}/${bowtiepe}_rg.sam \
	O=${outfolder}/${bowtiepe}-pe.bam \
	SO=coordinate \
	CREATE_INDEX=true \
	VALIDATION_STRINGENCY=LENIENT && rm ${outfolder}/${bowtiepe}_rg.sam
 
echo
echo "# get basic stats from the resulting bam file"
 
##### get basic stats from the resulting bam file
samtools flagstat ${outfolder}/${bowtiepe}-pe.bam \
	>${outfolder}/${bowtiepe}_flagstat.txt
 
echo "# finished sorting and indexing"

samtools flagstat results

## results from the Bowtie2 mapping on 10 pc sample
1498242 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
1485061 + 0 mapped (99.12%:-nan%)
1498242 + 0 paired in sequencing
749121 + 0 read1
749121 + 0 read2
1471456 + 0 properly paired (98.21%:-nan%)
1476598 + 0 with itself and mate mapped
8463 + 0 singletons (0.56%:-nan%)
4090 + 0 with mate mapped to a different chr
1910 + 0 with mate mapped to a different chr (mapQ>=5)

## results from the BWA-mem mapping on 10 pc sample
1485462 + 0 in total (QC-passed reads + QC-failed reads)
180 + 0 duplicates
1483826 + 0 mapped (99.89%:-nan%)
1485462 + 0 paired in sequencing
742791 + 0 read1
742671 + 0 read2
1469184 + 0 properly paired (98.90%:-nan%)
1482186 + 0 with itself and mate mapped
1640 + 0 singletons (0.11%:-nan%)
4664 + 0 with mate mapped to a different chr
4291 + 0 with mate mapped to a different chr (mapQ>=5)

Identify duplicate reads with Picard MarkDuplicates

#! /usr/bin/env bash
## script: 'picard_markdup-bowtie2.sh'
## ©SP-BITS, 2013 v1.0
# last edit: 2014-04-08
 
# required: 
# Picard Version: 1.112+
 
# declare variables
infile=hg19_bowtie2-mapping/bowtie_mapped-reads-pe.bam
outname=${infile%%-pe.bam}
 
echo "# marking duplicates in ${infile}"
 
java -Xmx4g -jar $PICARD/MarkDuplicates.jar \
        I=${infile} \
        O=${outname}_mdup.bam \
        ASSUME_SORTED=TRUE \
        REMOVE_DUPLICATES=FALSE \
        CREATE_INDEX=TRUE \
        M=${outname}_duplicate_metrics.txt \
        VALIDATION_STRINGENCY=LENIENT \
        2>${outname}_MarkDuplicates.err
echo
 
echo "# results are"
head -8 ${outname}_duplicate_metrics.txt| tail -2 | transpose -t | column -t
echo

Picard MarkDuplicates counts

##MAPPER##                    Bowtie2      BWA-mem
LIBRARY                       lib-NA18507  lib-NA18507
UNPAIRED_READS_EXAMINED       83867        17268
READ_PAIRS_EXAMINED           7387318      7477196
UNMAPPED_READS                131691       18534
UNPAIRED_READ_DUPLICATES      24266        4725
READ_PAIR_DUPLICATES          3371         5428
READ_PAIR_OPTICAL_DUPLICATES  1110         1310
PERCENT_DUPLICATION           0.002087     0.001041
ESTIMATED_LIBRARY_SIZE        12062126176  6783431981

Handicon.png The duplication rate is not alarming here with 3371 PCR duplicates and 1110 otical duplicates detected in 7'387'318 pairs

Perform integrated BAM QC with Qualimap

Qualimap command parameters

usage: qualimap bamqc -bam <arg> [-c] [-gd <arg>] [-gff <arg>] [-nr <arg>] [-nt
       <arg>] [-nw <arg>] [-os] [-outdir <arg>] [-outformat <arg>]
 -bam <arg>                     input mapping file
 -c,--paint-chromosome-limits   paint chromosome limits inside charts
 -gd <arg>                      compare with genome distribution (possible
                                values: HUMAN or MOUSE)
 -gff <arg>                     region file (in GFF/GTF or BED format)
 -hm <arg>                      minimum size for a homopolymer to be considered
                                in indel analysis (default is 3)
 -nr <arg>                      number of reads in the chunk (default is 500)
 -nt <arg>                      number of threads (default equals the number of cores)
 -nw <arg>                      number of windows (default is 400)
 -os,--outside-stats            compute region outside stats (only with -gff
                                option)
 -outdir <arg>                  output folder
 -outformat <arg>               output report format (PDF or HTML, default is
                                HTML)
 -p <arg>                       specify protocol to calculate correct strand
                                reads (works only with -gff option, possible
                                values are STRAND-SPECIFIC-FORWARD or
                                STRAND-SPECIFIC-REVERSE, default is
                                NON-STRAND-SPECIFIC)
#! /usr/bin/env bash
## script: 'qualimap_bamqc-bowtie2.sh'
## ©SP-BITS, 2014 v1.0
# last edit: 2014-04-09
 
# required: 
# qualimap 0.8
# R dependencies
 
# full chromosome 
# outfolder="bowtie2-mapping-qualimapQC/full_chromosome"
# infile="hg19_bowtie2-mapping/full_chromosome/bowtie_mapped-reads_mdup_chr21.bam"
 
# declare variables
outfolder="bowtie2-mapping-qualimapQC"
mkdir -p ${outfolder}
 
infile="hg19_bowtie2-mapping/bowtie_mapped-reads_mdup.bam"
outname=$(basename "${infile}" ".bam")
mygtf="ref/Homo_sapiens.hg19_chr21.gtf"
# dedicate as much RAM as you can below (default=1200M)
maxram="4096M"
 
echo "# processing ${infile}"
 
# read the full help to find more about other parameters
# we output here to HTML but PDF is also available for portability and compactness
# we restrict the analysis to chr21 using annotation data from the Qualimap site (-gff)
# we also ask analysis outside of regions defined in the 'gtf' file (-os)
## Human genome annotations from Ensembl database (v.64)
qualimap bamqc --java-mem-size=${maxram} \
	-bam ${infile} \
	-gd HUMAN \
	-c \
	-gff ${mygtf} \
	-os \
	-outdir ${outfolder}/${outname}_qualimap-results \
	-outformat HTML
 
# for PDF, replace last line by
#	-outformat PDF
#	-outfile ${outname}_chr21_report.pdf

extract all chr21 reads with samtools view

Because the re-mapping was done against the full human genome, some of the reads may now map to other chromosomes. For each method, we extract all reads mapping at least by one end to chr21 to a new bam file using samtools to get a clear chr21-specific picture out of our results.

#! /usr/bin/env bash
# script: samtools_extract-chr21-bowtie.sh
 
# required:
# samtools Version: 0.1.19+
 
infile=hg19_bowtie2-mapping/bowtie_mapped-reads_mdup.bam
 
echo "# processing ${infile}"
 
# subset and index ONLY chr21 calls
# also produce a text summary with samtools flagstat
samtools view -b ${infile} chr21 > ${infile%%.bam}_chr21.bam && \
 	samtools index ${infile%%.bam}_chr21.bam && \
 	samtools flagstat ${infile%%.bam}_chr21.bam >  ${infile%%.bam}_chr21.flagstat.txt

samtools flagstat counts

14846590 + 0 in total (QC-passed reads + QC-failed reads)
28800 + 0 duplicates
14773537 + 0 mapped (99.51%:-nan%)
14846590 + 0 paired in sequencing
7423315 + 0 read1
7423275 + 0 read2
14672436 + 0 properly paired (98.83%:-nan%)
14700484 + 0 with itself and mate mapped
73053 + 0 singletons (0.49%:-nan%)
18122 + 0 with mate mapped to a different chr
16582 + 0 with mate mapped to a different chr (mapQ>=5)

Compare BWA-mem and Bowtie2 mappings

The Picard tool CompareSAMs compares the headers of the two input SAM or BAM files, and, if possible, the SAMRecords. For SAMRecords, compares only the readUnmapped flag, reference name, start position and strand. Reports the number of SAMRecords that match, differ in alignment, are mapped in only one input, or are missing in one of the files.

Handicon.png Only the subset of hg19 mappings attributed to chr21 is analyzed here using data obtained by the trainer

#! /usr/bin/env bash
# script: picard_CompareSAMs-bowtie2.sh
 
# required: 
# Picard Version: 1.112+
 
# declare variables
bam1=hg19_bwa-mapping/full_chromosome/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam
bam2=hg19_bowtie2-mapping/full_chromosome/bowtie_mapped-reads_mdup_chr21.bam
 
outfolder="hg19_bowtie2-mapping"
name="bwa_vs_bowtie2"
 
java -Xmx4g -jar $PICARD/CompareSAMs.jar \
	${bam1} \
	${bam2} >\
	${outfolder}/CompareSAMs_${name}.txt \
	2>${outfolder}/CompareSAMs_${name}.err

comparison results for bwa-mem & bowtie2 methods

Sample for read group NA18507 differs.
Final_Results/hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam: GAIIx-chr21-BWA.aln
hg19_bowtie2-mapping/bowtie_mapped-reads_mdup_chr21.bam: GAIIx-chr21-Bowtie2
Number of program records differs.
Final_Results/hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam: 3
hg19_bowtie2-mapping/bowtie_mapped-reads_mdup_chr21.bam: 2
Match   14501704
Differ  220357
Unmapped_both   5007
Unmapped_left   8923
Unmapped_right  52603
Missing_left    43683
Missing_right   25068
SAM files differ.

A very basic comparison of the resulting BAM files shows >98% agreement between both methods with apparently the BWA algorithm mapping more reads to chr21 than the bowtie2 method (unmapped 8923 vs 52603 ; using the default parameters!).

Call variants from the Bowtie2 mappings using samtools_0.2.0 and bcftools (htlib)

#! /usr/bin/env bash
# script bcftools_call-variants.sh
 
# required
# samtools_0.2.0 (htslib)
# bcftools_0.2.0 (htslib) call, vcfutils_0.2.0.pl
 
# source user-defined functions (including vcf2index)
. $HOME/.myfunctions
 
# we call from the chr21 bowtie2 mapping results 
infolder="hg19_bowtie2-mapping"
infile="bowtie_mapped-reads_mdup_chr21.bam"
name="NA18507_bowtie2"
 
outfolder="bcftools_htslib_variants-bowtie2"
mkdir -p ${outfolder}
 
ref=ref/HiSeq_UCSC_hg19.fa
 
# call variants from pileup, filter and convert calls to VCF format
# exclude calls where more than 1000 reads support a variation
 
(samtools_0.2.0 mpileup -uf ${ref} ${infolder}/${infile} | \
   bcftools_0.2.0 call -vc -O u > ${outfolder}/${name}_var_htslib.raw.bcf \
   2>${outfolder}/samtools_mpileup_${name}_var_htslib_raw.err) && \
(bcftools_0.2.0 view ${outfolder}/${name}_var_htslib.raw.bcf | \
   vcfutils_0.2.0.pl varFilter -D1000 > \
   ${outfolder}/${name}_var_htslib.flt-D1000.vcf \
   2>${outfolder}/samtools_mpileup_${name}_htslib_filter.err)
 
# Sort, compress, and index the variant VCF files
vcf2index ${outfolder}/${name}_var_htslib.flt-D1000.vcf && \
	rm ${outfolder}/${name}_var_htslib.flt-D1000.vcf
 
# extract the chr21 subset of calls
vcftools --gzvcf ${outfolder}/${name}_var_htslib.flt-D1000.vcf.gz \
	--chr chr21 \
	--out ${outfolder}/chr21_${name}_var_htslib.flt-D1000 \
	--recode
 
# recoding introduces the word recode at the end of the file-name
# remove the recode tag from the name
mv ${outfolder}/chr21_${name}_var_htslib.flt-D1000.recode.vcf \
	${outfolder}/chr21_${name}_var_htslib.flt-D1000.vcf
 
# sort, compress and index using our custom 'vcf2index' function)
vcf2index ${outfolder}/chr21_${name}_var_htslib.flt-D1000.vcf && \
	rm ${outfolder}/chr21_${name}_var_htslib.flt-D1000.vcf

Review samtools bcftools variant calls

# analysis of 'aln' calls
infolder="bcftools_htslib_variants-bowtie2"
infile="NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz"
 
zgrep -v "^#" ${infolder}/${infile} | \
	cut -f 1 | sort | uniq -c | sort -k 2V,2

variant counts for the bowtie data

  77688 chr21

Perform QC on the obtained VCF files

# analysing the chr21 subset of our bwa-calls
 
infolder="bcftools_htslib_variants-bowtie2"
infile="NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz"
 
bcftools_0.2.0 stats ${infolder}/${infile} > \
	${infolder}/chr21_bowtie2_var.check
 
plot-vcfstats_0.2.0 -t "NA18507_bowtie2-calls" \
	 -p ${outfolder}/chr21_bowtie2_plots/ \
	 ${infolder}/chr21_bowtie2_var.check

The summary statistics derived from the VCF file obtained from the bowtie2 mappings is presented below.

hg19 bowtie2 variants PDF
plot-vcfcheck_bowtie2-0.png
plot-vcfcheck_bowtie2-1.png
plot-vcfcheck_bowtie2-2.png
plot-vcfcheck_bowtie2-3.png

Compare the BWA-mem and Bowtie2 variants using vcftools

#! /usr/bin/env bash
 
infile1=bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz
infile2=Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz
 
outfolder=bcftools_htslib_variants-bowtie2/compare_chr21_bowtie2-bwa-mem
mkdir -p ${outfolder}
 
# run vcftools diff to identify differences
vcftools --gzvcf ${infile1} \
	--gzdiff ${infile2} \
	--out ${outfolder}/compare_chr21_NA18507_bowtie2-bwa-mem

comparing bowtie and bwa data

VCFtools - v0.1.12a
(C) Adam Auton and Anthony Marcketta 2009
 
Parameters as interpreted:
	--gzvcf bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz
	--out bcftools_htslib_variants-bowtie2/compare_chr21_bowtie2-bwa-mem/compare_chr21_NA18507_bowtie2-bwa-mem
	--gzdiff Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz
 
Using zlib version: 1.2.3
Versions of zlib >= 1.2.4 will be *much* faster when reading zipped VCF files.
After filtering, kept 1 out of 1 Individuals
Using zlib version: 1.2.3
Versions of zlib >= 1.2.4 will be *much* faster when reading zipped VCF files.
Comparing individuals in VCF files...
N_combined_individuals:	2
N_individuals_common_to_both_files:	0
N_individuals_unique_to_file1:	1
N_individuals_unique_to_file2:	1
Comparing sites in VCF files...
Non-matching REF. Skipping all such sites.
Found 75788 SNPs common to both files.
Found 1792 SNPs only in main file.
Found 8889 SNPs only in second file.
After filtering, kept 77688 out of a possible 77688 Sites
Run Time = 57.00 seconds

Compare bowtie2 & BWA-mem results with vcf-compare

Another tool to compare VCF files is vcf-compare that produces values directly usable to draw VENN diagrams. The command can look at positions only (less stringent) or at position + genotype.

infolder=bcftools_htslib_variants-bowtie2
 
infile1=bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz
infile2=Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz
 
vcf-compare ${infile1} \
	${infile2} | \
	grep ^VN | cut -f 2- | column -t \
	> ${infolder}/vcf-compare_bowtie2_bwa-mem.txt

command results

1792   bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz  (2.3%)
8889   Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz     (10.5%)
75896  Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz     (89.5%) \
 bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz  (97.7%)

The obtained counts were used in 2DVenn.R to plot the next figure. This Venn diagram is only slightly different from the previous one, possibly due to the reported error: 'Non-matching REF. Skipping all such sites.'.

vcftools venn diagram

2DVenn.R -a 1792 -b 8889 -i 75896 -A "bowtie2" -B "BWA mem" \
   -o ${infolder}/bowtie2_bwa-mem_recall -t "bcftools_0.1.19 calls" -x 1 -u 2


bowtie2_bwa-mem_recall.png

Compare bowtie2 & BWA-mem results to the Complete Genomics and HapMap3.3 sets with vcf-compare

What about gold-standard calls?. We add two more important reference sets being Complete Genomics and HapMap. Looking at variants from the hapMap3.3 set indicates true positive but is restricted on variants present in that set. By contrast looking at the Complete Genomics set reveals additional variants of high quality in either BWA or Bowtie lists. This is not all what could be done to truly appreciate the quality of BWA and Bowtie but constitute an honest estimate.

infolder=bcftools_htslib_variants-bowtie2
 
infile1=bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz
infile2=Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz
infile3=public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz
infile4=public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz
 
vcf-compare ${infile1} \
	${infile2} \
	${infile3} \
	${infile4} | \
	grep ^VN | cut -f 2- | column -t \
	> ${infolder}/vcf-compare4.txt

command results

2       bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz  (0.0%)\
   public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz            (0.0%)
7       Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz     (0.0%)\
   public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz            (0.1%)
11      public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz                           (0.0%)\
   public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz            (0.1%)
21      Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz     (0.0%)\
   public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz                           (0.0%)\
   public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz  (0.2%)
70      public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz            (0.7%)
90      Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz     (0.1%)\
   bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz  (0.1%)\
   public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz  (0.9%)
506     bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz  (0.7%)\
   public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz                           (0.3%)
1284    bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz  (1.7%)
2635    Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz     (3.1%)\
   public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz                           (1.3%)
6226    Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz     (7.3%)
7193    Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz     (8.5%)\
   bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz  (9.3%)
9851    Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz     (11.6%)\
  bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz  (12.7%)\
  public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz                 (4.9%)\
   public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz  (98.0%)
58762   Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz     (69.3%)\
  bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz  (75.6%)\
  public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz                 (29.3%)
128921  public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz                           (64.2%)

The obtained counts were used in 4DVenn.R to plot the next figure. This Venn diagram is only slightly different from the previous one, possibly due to the reported error: 'Non-matching REF. Skipping all such sites.'.

vcftools venn diagram

4DVenn.R --ab.count 2 --bd.count 7 --cd.count 11 --bcd.count 21 --d.count 70 --abd.count 90 --ac.count 506 \
     --a.count 1284 --bc.count 2635 --b.count 6226 --ab.count 7193 --abcd.count 9851 --abc.count 58762 --c.count 128921 \
     -A "Bowtie2" -B "BWA-mem" -C "CG" -D "HapMap3.3" -t "4-way recall for chr21" -x 1 \
     -o "bcftools_htslib_variants-bowtie2/bowtie2_4wy_recall" -u 2


bowtie2_4way_recall.png

Discussion about BWA, Bowtie2, and this work

It should be noted that the paired end reads used in this workshop are not all the reads obtained by sequencing the NA18507 sample but only those present in the Illumina BAM file (hg18-mapped) downloaded from the Illumina site. Using the original (unmapped) full-read set would have been more correct in that regard but would have likely added more noise than real signal to the comparison and taken more time for the full genome mapping step. All counts and figures reported in our pages were obtained without optimizing the different methods and applying standard parameters for each step which probably leaves room for improvement for expert users. By contrast, the fact that we have used real reads from one full chromosome makes our data probably more 'real' than many reported benchmarks that are based on simulated reads.

# Considering only Bowtie2 and BWA-mem derived calls, we can say that:

  • of the 86577 individual calls merged from both sets; 1792 (2%) are private to bowtie2, 8889 (10%) are found only by BWA-mem, and 75896 (88%) are called from both mapping sets.

Handicon.png BWA-mem seems calls ~5x more private variants than Bowtie2 but both methods share the larger number of calls (80%))

# Considering Complete Genomics as the best available high-quality call-set, we can draw few conclusions about this work.

  • of the 1792 variants private to bowtie2 (see 2DVenn above), 506 (28%) are also called by Complete Genomics while 1284 (72%) are not.
  • of the 8889 variants private to BWA-mem (id), 2656 (30%) are also called by Complete Genomics while 6226 (70%) are not.

Handicon.png Bowtie2 seems a little bit less sensitive than BWA-mem based on this observation (calls 2% less of the high-quality CG variants)

# Considering HapMap3.3 calls as the best validated set available, we can refine this statement by saying that:

  • Bowtie2 does not report calls validated by HapMap and absent in the BWA-set
  • reversely: BWA-mem does report 21+7=28 chr21 calls unknown to bowtie but present in HapMap. This number is relatively small as compared to the 9851+90=9941 calls shared by all three sets.

Handicon.png BWA-mem calls a small number of additional gold-standard variants, undetected in Bowtie2 mappings

# Considering the mapping time required on a robust BITS desktop machine to map all chr21 reads using all 8 cores and up to ~8GB RAM (of the available 16GB).

benchmark command details

## test run for bowtie2
 
# bowtie2 -p 8 \
    -x /data/biodata/bowtie-index/HiSeq_UCSC_hg19 \
    -q \
    -1 Final_Results/reads_results/shuffled_PE_NA18507_GAIIx_100_chr21_2_1.fq.gz \
    -2 Final_Results/reads_results/shuffled_PE_NA18507_GAIIx_100_chr21_2_2.fq.gz \
    --phred33 \
    --fr \
    -S  hg19_bowtie2-mapping-time/bowtie_mapped-reads.sam
 
# finished mapping shuffled_PE_NA18507_GAIIx_100_chr21 reads with Bowtie2
 
real    16m33.786s
user    118m16.500s
sys     3m39.556s
 
## test run for bwa-mem
 
# bwa mem \
    -R "@RG\tID:NA18507\tLB:lib-NA18507\tPU:unkn-0.0\tPL:ILLUMINA\tSM:GAIIx-chr21-BWA.mem" \
    -M \
    -t 8 \
    ref/HiSeq_UCSC_hg19.fa  \
    Final_Results/reads_results/shuffled_PE_NA18507_GAIIx_100_chr21_2_1.fq.gz \
    Final_Results/reads_results/shuffled_PE_NA18507_GAIIx_100_chr21_2_2.fq.gz \
    > hg19_bwa-mapping-time/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe.sam
 
# finished mapping shuffled_PE_NA18507_GAIIx_100_chr21 reads with bwa-mem
 
real    15m32.321s
user    52m40.842s
sys     0m35.390s
plot of the computer activity
bowtie2_log.png
bwa-mem_log.png
The code used to create these plots can be obtained from the BITS github repository[3]
  • bowtie2 used less RAM than bwa-mem at full load (~4GB vs ~6GB).
  • bowtie2 and bwa-mem took about the same time with 15-16 min (real time) including the time to load the reference index in RAM.

Conclusion

We conclude that with the standard parameters used for both BWA-mem (https://github.com/lh3/bwa[4]) and Bowtie2, both methods seem valid and present a small fraction of specific calls. In our hands, Bowtie2 and BWA-MEM are almost as fast with this realistic paired-end read mapping experiment and calls made on Bowtie2 mappings (--sensitive settings) missed a number validated calls made from BWA-mem mappings (standard settings).

Technical.png This is in apparent contradiction with reports claiming that Bowtie is much faster than BWA! Is it possible that such claims report speed obtained with older version(s) of the softwares?

Based on the current report, we advise to use BWA-mem rather than Bowtie2 when mapping paired-end reads to a full genome as we see that this mapper allows calling more gold-standard variants AND is not slower than the competitor program Bowtie2 (both taken from their most recent version with similar settings)

# Please send your reactions to bits@vib.be

download exercise files

Download exercise files here

Use the right application to open the files present in ex3c-files

References:
  1. http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
  2. http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn
  3. https://github.com/BITS-VIB/geek-tools
  4. https://github.com/lh3/bwa

[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 | NGS Exercise.3 | NGS Exercise.3b | NGS Exercise.3d | NGS Exercise.3e ]