GATK HaplotypeCaller Analysis of BWA (mem) mapped Illumina reads

last edited: October 4th, 2013 - v1.00)

Practicing GATK with data obtained during the Hands-on introduction to NGS variant analysis

GATK is in constant evolution, please refer to the online documentation for updates and modified BestPractices. The content of this page was edited for version 3.0

Aims

The Broad Institute **GATK** suite is today's high end standard for NGS data analysis. Although a lot of documentation is present on the GATK website, it can still be challenging to apply a full analysis workflow to your own data for the first time.

This document details each steps of a prototype GATK analysis using hg19 mapping data. The data used here was generated using Illumina reads of the HapMap individual **NA18507** as described in 'Hands-On BITS session: *Mapping Illumina reads to the human genome and calling variations on chr21*'

Readers are expected to have done the full analysis using hands-on workflows and have access to some of the data and reference files used there. Additional data is required to use GATK tools including the GATK executable (java) as well as Broads **bundle** files matching the reference genome used in the training as detailed inline.

Please NOTE that GATK can be obtained by academic researchers for research purpose by accessing the Broad platform ^[1]. For profit and industrial users, a special agreement should be made with the Broad Institute prior to getting their code. This training session is meant for non-profit users that are entitled to use the GATK material

Resources and Reference links

!! Resources linked here might be outdated at the time you read this document.

Most of the information used here was copied from pages organized on the GATK online platform, The broad Institute should therefore receive full credit if you use this material elsewhere. In order to cite GATK, please follow instructions on their dedicated page: http://www.broadinstitute.org/gatk/about/citing-gatk

GATK online help is accessible at ^[2].

A PDF version of these pages is available at *http://www.broadinstitute.org//gatk/pdfdocs/*. Look for the latest version at the time your access the page.

The Java code is available for academic at <http://www.broadinstitute.org/gatk/download>) and is installed by simply storing the uncompressed folder in the computer PATH.

The Broad CRAN **gsalib** package is additionally required to generate plots in GATK. The package is found at <http://cran.r-project.org/web/packages/gsalib/index.html> and can be installed directly from within **RStudio** with the simple command '*install.packages("gsalib")*'. After installation, this package should be accessible for the GATK commands, if this is not the case, please refer to the GATK help and set your environment to make the **gsalib** code part of the standard path. The documentation can be accessed on the CRAN pages <http://cran.r-project.org/web/packages/gsalib/gsalib.pdf>.

Finally, Broad **Bundle** files are required for annotating and filtering your data and can be found on the public FTP site. We provide a script to batch download these files in appendix. Additional reference files are also required but were already obtained for BITS NGS training.

SnpEff used to extract variant subsets but that can do much more. Please refer to the online documentation for the full description. <http://snpeff.sourceforge.net/index.html>

vcfTools already used previously to compute VENN diagram from the obtained and reference VCF files. <http://vcftools.sourceforge.net>.

vcftools was updated since the making of this tutorial, command names will be changed if you install the new version

A GATK Best Practice Workflow

The current Best practice for GATK is illustrated in the next figure and constitutes the framework for this tutorial.

The presented workflow is directly inspired from the GATK Best Practice slide reproduced above. Each step was extracted from the corresponding GATK online documentation and adapted to fit the location of our files. The right panel (grey), dedicated to downstream analysis is not part of this tutorial which aims at generating high quality data.

Define file path for the workflows

The path is adapted from that of the BITS hands-on NGS training.

project=/Users/bits/NGS_DNASeq-training
base=${project}/Final_Results
reference=${project}/ref/HiSeq_UCSC_hg19.fa
 
mapping=${base}/hg19_mapping-full
 
# create a folder to receive all results from this GATK workflow
result=${base}/gatk
mkdir -p ${result}

Add some indexing to our reference files (If not already available)

# create reference sequence dictionary if not present
[ -f "${project}/ref/HiSeq_UCSC_hg19.dict" ] || \
java -Xmx6g -jar $PICARD/CreateSequenceDictionary.jar \
	R=${reference} \
	O=${project}/ref/HiSeq_UCSC_hg19.dict \
	GENOME_ASSEMBLY=hg19
 
# create reference index if not present
[ -f "${project}/ref/HiSeq_UCSC_hg19.fa.fai" ] || \
samtools faidx ${project}/ref/HiSeq_UCSC_hg19.fa

Prepare the GATK input data and Workflow elements

Clean the initial BWA bam data to make it Picard & GATK compliant

In this part, we will fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. These steps can be performed independently of each other but the order proposed here is recommended.

# Use the mapping file obtained with BWA (mem)
ori-bam=${base}/hg19_mapping/NA18507_GAIIx_100_chr21_aln-pe.bam

Correct potential mate pair info and coordinate Sort the reads (requires Picard tools installed and running)

java -Xmx6g -jar $PICARD/FixMateInformation.jar \
	I=${result}/${ori-bam} \
	O=${result}/sorted_reads.bam \
	SO=coordinate \
	VALIDATION_STRINGENCY=LENIENT

Mark duplicate reads (optical duplicates could bias variant detection by adding excessive coverage depth at a variant locus

java -Xmx6g -jar $PICARD/MarkDuplicates.jar \
	I=${result}/sorted_reads.bam \
	O=${result}/dedup_reads.bam \
	M=${result}/duplicate_metrics.txt

The result is a text file starting by:

12:47, 13 March 2014 (UTC)~~
## net.sf.picard.metrics.StringHeader
# net.sf.picard.sam.MarkDuplicates INPUT=[sorted_reads.bam] OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
## net.sf.picard.metrics.StringHeader
# Started on: Thu Oct 03 12:21:20 CEST 2013
 
## METRICS CLASS        net.sf.picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED     UNMAPPED_READS  UNPAIRED_READ_DUPLICATES        READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION     ESTIMATED_LIBRARY_SIZE
        37296   7379961 192976  8610    3101    1059    0.001001        13329615355
 
## HISTOGRAM    java.lang.Double
BIN     VALUE
1.0     1.000143
2.0     1.999733
3.0     2.99877
4.0     3.997254
5.0     4.995185
6.0     5.992563
...
12:47, 13 March 2014 (UTC)~~

Add read group information required by GATK (adapted from our BWA command)

# ID:NA18507
# LB=lib-NA18507
# PU=unkn-0.0
# PL:ILLUMINA
# SM:GAIIx-chr21-BWA.mem
java -Xmx6g -jar $PICARD/AddOrReplaceReadGroups.jar \
    INPUT=${result}/dedup_reads.bam \
    OUTPUT=${result}/addrg_reads.bam \
    RGID="NA18507" \
    RGLB="lib-NA18507" \
    RGPL="ILLUMINA" \
    RGPU="unkn-0.0" \
    RGSM="GAIIx-chr21-BWA.mem.gatk"

Index the last file for further use

java -Xmx6g -jar $PICARD/BuildBamIndex.jar \
	INPUT=${result}/addrg_reads.bam

Create symbolic links to reference data required for the GATK workflow

A number of GATK bundle files are necessary to run this workflow. The files for build hg19 were obtained from the GATK ftp site and are defined below. The full bundle can be obtained with the script **wget-bundle.sh** attached in appendix-01

# BUNDLE: locate important files for GATK analysis
ver=2.5
bundle=$BIODATA/bundle_${ver}/hg19
# INDEL gold standards
mills=${bundle}/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz
phase1indel=${bundle}/1000G_phase1.indels.hg19.vcf.gz
# pick one of the former two as gold_indels set
gold_indels=${mills}
# SNP gold standards
dbsnp=${bundle}/dbsnp_137.hg19.vcf.gz
hapmap=${bundle}/hapmap_3.3.hg19.vcf.gz
phase1snp=${bundle}/1000G_phase1.snps.high_confidence.hg19.vcf.gz
omni=${bundle}/1000G_omni2.5.hg19.vcf.gz

Improve mapping by local realignment (indels)

Identify regions for indel local realignment of the selected chromosome

java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T RealignerTargetCreator \
	-R ${reference} \
	-I ${result}/addrg_reads.bam \
	-L chr21 \
	-known ${gold_indels} \
	-o ${result}/target_intervals.list

The target_intervals.list starts by:

12:47, 13 March 2014 (UTC)~~
chr21:9442356-9442357
chr21:9466976-9466977
chr21:9467571-9467572
chr21:9467834-9467835
chr21:9470889-9470890
chr21:9474246-9474247
chr21:9476537-9476609
...
12:47, 13 March 2014 (UTC)~~

Perform indel local realignment of the selected chromosome

java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
    -T IndelRealigner \
	-R ${reference} \
	-I ${result}/addrg_reads.bam \
	-L chr21 \
	-targetIntervals ${result}/target_intervals.list \
	-known ${gold_indels} \
	-o ${result}/realigned_reads.bam

Analyze patterns of covariation in the sequence dataset

java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
    -T BaseRecalibrator \
    -R ${reference} \
    -I ${result}/realigned_reads.bam \
    -L chr21 \
    -knownSites ${dbsnp} \
    -knownSites ${gold_indels} \
    -o ${result}/recal_data.table

Do a second pass to analyze covariation remaining after recalibration

Use the same BAM data as in the first iteration

java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
    -T BaseRecalibrator \
    -R ${reference} \
    -I ${result}/realigned_reads.bam \
    -L chr21 \
    -knownSites ${dbsnp} \
    -knownSites ${gold_indels} \
    -BQSR ${result}/recal_data.table \
    -o ${result}/post_recal_data.table

Generate before/after plots

Requires install.packages("gsalib") in RStudio and that RScript can be found in the PATH

java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
    -T AnalyzeCovariates \
    -R ${reference} \
    -L chr21 \
    -before ${result}/recal_data.table \
    -after ${result}/post_recal_data.table \
    -plots ${result}/recalibration_plots.pdf

Apply the recalibration to your sequence data

java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
    -T PrintReads \
    -R ${reference} \
    -L chr21 \
    -I ${result}/realigned_reads.bam \
    -BQSR ${result}/recal_data.table \
    -o ${result}/recal_reads.bam

Compress your sequence data

The BAM file obtained in the last step is about twice as big as the initial BAM. IN this step, the size is brought back to a fraction of this initial size (about half). It should also speed-up analysis by other GATK tools.

The reduction was retired in later GATK versions

java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
    -T ReduceReads \
    -R ${reference} \
    -I ${result}/recal_reads.bam \
    -L chr21 \
    -o ${result}/reduced_reads.bam

Call all variants using the HaplotypeCaller

Call variants in your sequence data (diploid genome => HaplotypeCaller)

java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R ${reference} \
    -I ${result}/reduced_reads.bam \
    -L chr21 \
    --genotyping_mode DISCOVERY \
    -stand_emit_conf 10 \
    -stand_call_conf 30 \
    -o ${result}/raw_variants.vcf

Recalibrate variant scores in two steps: SNP then INDELS

During this step, additional annotations are added to the data that allow later filtering and selection of subsets. Some of the available annotations are listed here.

Build the SNP recalibration model
also add annotations from sources out of:
BaseQualityRankSumTest (BaseQRankSum)
DepthOfCoverage (DP)
FisherStrand (FS)
HaplotypeScore (HaplotypeScore)
MappingQualityRankSumTest (MQRankSum)
MappingQualityZero (MQ0)
QualByDepth (QD)
ReadPositionRankSumTest (ReadPosRankSum)
RMSMappingQuality (MQ)
SnpEff: Add genomic annotations using the third-party tool SnpEff with VariantAnnotator

java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T VariantRecalibrator \
	-R ${reference} \
	-input ${result}/raw_variants.vcf \
	-resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${hapmap} \
	-resource:omni,known=false,training=true,truth=false,prior=12.0 ${omni} \
	-resource:1000G,known=false,training=true,truth=false,prior=10.0 ${phase1snp} \
	-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${dbsnp} \
	-an DP \
	-an QD \
	-an FS \
	-an MQRankSum \
	-an ReadPosRankSum \
	-mode SNP \
	-tranche 100.0 \
	-tranche 99.9 \
	-tranche 99.0 \
	-tranche 90.0 \
	-numBad 1000 \
	-recalFile ${result}/recalibrate_SNP.recal \
	-tranchesFile ${result}/recalibrate_SNP.tranches \
	-rscriptFile ${result}/recalibrate_SNP_plots.R

Again, a number of plots are generated by R

Apply SNP recalibration

java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T ApplyRecalibration \
	-R ${reference} \
	-input ${result}/raw_variants.vcf \
	-mode SNP \
	--ts_filter_level 99.0 \
	-recalFile ${result}/recalibrate_SNP.recal \
	-tranchesFile ${result}/recalibrate_SNP.tranches \
	-o ${result}/recalibrated_snps_raw_indels.vcf

Detailed plots are provided for each step of the recalibration. We only reproduce here page #1 of 10 similar pannels

Build the Indel recalibration model

java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T VariantRecalibrator \
	-R ${reference} \
	-input ${result}/recalibrated_snps_raw_indels.vcf \
	-resource:mills,known=true,training=true,truth=true,prior=12.0 ${mills} \
	-an DP \
	-an FS \
	-an MQRankSum \
	-an ReadPosRankSum \
	-mode INDEL \
	-tranche 100.0 \
	-tranche 99.9 \
	-tranche 99.0 \
	-tranche 90.0 \
	-numBad 1000 \
	--maxGaussians 4 \
	-recalFile ${result}/recalibrate_INDEL.recal \
	-tranchesFile ${result}/recalibrate_INDEL.tranches \
	-rscriptFile ${result}/recalibrate_INDEL_plots.R

Again, a number of plots are generated by R

Apply the desired level of recalibration to the Indels in the call set

java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \
	-T ApplyRecalibration \
	-R ${reference} \
	-input ${result}/recalibrated_snps_raw_indels.vcf \
	-mode INDEL \
	--ts_filter_level 99.0 \
	-recalFile ${result}/recalibrate_INDEL.recal \
	-tranchesFile ${result}/recalibrate_INDEL.tranches \
	-o ${result}/recalibrated_variants.vcf

Detailed plots are provided for each step of the recalibration. We only reproduce here page #1 of 10 similar panels

The content of the file can be estimated based on the FILTER column

grep -v "^#" recalibrated_variants.vcf | cut -f 7 | sort | uniq -c
#    940 LowQual
#  77239 PASS
#    365 VQSRTrancheINDEL99.00to99.90
#    836 VQSRTrancheINDEL99.90to100.00
#   4291 VQSRTrancheSNP99.00to99.90
#   6667 VQSRTrancheSNP99.90to100.00

Filter high quality data for comparison analysis

The double-recalibrated file contains a FILTER field (column #7) that can be used to extract a high-quality subset

Filter high quality variants using snpEff

# equivalent to filter "( ! FILTER='LowQual' )" 
cat ${result}/recalibrated_variants.vcf | \
	java -Xmx6g -jar /opt/biotools/snpEff/SnpSift.jar \
	filter "( na FILTER ) | (FILTER = 'PASS' | FILTER =~ 'VQSRT')" \
	| bgzip -c > ${result}/hq_recalibrated_variants.vcf.gz && \
	tabix -p vcf ${result}/hq_recalibrated_variants.vcf.gz

Perform QC on the VCF calls

Two tools can be used that are part of VCFTools and HTSLIB

The older command vcf-stats will generate text files that can be consulted for a given metric.

vcf-stats recalibrated_variants.vcf.gz -p recalibrated_variants_stats/
vcf-stats hq_recalibrated_variants.vcf.gz -p hq_recalibrated_variants_stats/

The current version of HTSLib produces nicer output already seen in the BITS training.

htscmd vcfcheck recalibrated_variants.vcf.gz > recalibrated_variants.chk
htscmd vcfcheck hq_recalibrated_variants.vcf.gz > hq_recalibrated_variants.chk
 
plot-vcfcheck recalibrated_variants.chk -p recalibrated_variants_plots/
plot-vcfcheck hq_recalibrated_variants.chk -p hq_recalibrated_variants_plots/

For all GATK calls

For the High Quality subset

Comparing GATK and Samtools calls with gold standard NA15807 variant lists

intersect results with former Training results

casava=public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz
bcftools=${base}/samtools_variants/chr21_mem_var.flt-D1000.vcf.gz
hapmap=public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz
gatk_raw=${result}/recalibrated_variants.vcf.gz
gatk_hq=${result}/hq_recalibrated_variants.vcf.gz
 
vcf-compare ${casava} \
	${bcftools} \
	${gatk_raw} \
	${gatk_hq} | grep "^VN" > ${result}/compare-4_2gatk.cmp

vcf-compare ${casava} \
	${bcftools} \
	${gatk_hq} \
	${hapmap} | grep "^VN" > ${result}/compare-4_2hapmap.cmp

Discussion and Conclusion

The full processing of the BWA mem mapping data for NA18507 and its comparison with the results available from reference sources and from the BITS mapping and calling experiment are very similar. A few % of the calls are private to one caller or to the public data. The comparison with the HapMap data (re validated information) shows a similar sensitivity for both methods.

The time necessary to perform this full workflow is of about 1/2 day while the corresponding part using **samtools** and **bcftools** runs in a few minutes. This time difference will be even more pronounced when dealing with a full genome analysis.

By contrast, GATK being accepted as the cleanest way to call high quality variant might motivate spending this extra time on the final stages of NGS DNA variant analysis

Another possibility, applied by many groups would be to take the intersection of both methods as a highest possible prediction quality in the case of an unknown genome.

In conclusion, we show here that applying GATK to call variants is a tedious but doable process. It requires time and computer power but can be run on a reasonable computer (strong laptop).

What was not explored here is the possibility to split this task in parallel jobs when doing full genome analysis of when analyzing multiple genomes. The GATK toolkit is provided with an additional toolbox called **QUEUE** that was not tried here but is meant to accelerate routine or heavy load analyses.

Download Hands-on data and results

Download hands-on data from the BITS server

Use the right application to open the files present in hands-on related files

References:

[ Main_Page ]

[1] ttp://www.broadinstitute.org/gatk/

[2] ttp://www.broadinstitute.org/gatk/guide/topic?name=tutorials

[1]

[2]