GATK HaplotypeCaller Analysis of BWA (mem) mapped Illumina reads
last edited: October 4th, 2013 - v1.00)
[ Main_Page ]
Practicing GATK with data obtained during the Hands-on introduction to NGS variant analysis
GATK is in constant evolution, please refer to the online documentation for updates and modified BestPractices. The content of this page was edited for version 3.0
Contents
- 1 Aims
- 2 Resources and Reference links
- 3 A GATK Best Practice Workflow
- 3.1 Define file path for the workflows
- 3.2 Add some indexing to our reference files (If not already available)
- 3.3 Prepare the GATK input data and Workflow elements
- 3.3.1 Clean the initial BWA bam data to make it Picard & GATK compliant
- 3.3.2 Correct potential mate pair info and coordinate Sort the reads (requires Picard tools installed and running)
- 3.3.3 Mark duplicate reads (optical duplicates could bias variant detection by adding excessive coverage depth at a variant locus
- 3.3.4 Add read group information required by GATK (adapted from our BWA command)
- 3.3.5 Index the last file for further use
- 3.3.6 Create symbolic links to reference data required for the GATK workflow
- 3.4 Improve mapping by local realignment (indels)
- 3.4.1 Identify regions for indel local realignment of the selected chromosome
- 3.4.2 Perform indel local realignment of the selected chromosome
- 3.4.3 Analyze patterns of covariation in the sequence dataset
- 3.4.4 Do a second pass to analyze covariation remaining after recalibration
- 3.4.5 Generate before/after plots
- 3.4.6 Apply the recalibration to your sequence data
- 3.4.7 Compress your sequence data
- 3.5 Call all variants using the HaplotypeCaller
- 3.5.1 Call variants in your sequence data (diploid genome => HaplotypeCaller)
- 3.5.2 Recalibrate variant scores in two steps: SNP then INDELS
- 3.5.3 Apply SNP recalibration
- 3.5.4 Build the Indel recalibration model
- 3.5.5 Apply the desired level of recalibration to the Indels in the call set
- 3.5.6 Filter high quality data for comparison analysis
- 3.5.7 Perform QC on the VCF calls
- 3.6 Comparing GATK and Samtools calls with gold standard NA15807 variant lists
- 4 Discussion and Conclusion
- 5 Download Hands-on data and results
Aims
The Broad Institute **GATK** suite is today's high end standard for NGS data analysis. Although a lot of documentation is present on the GATK website, it can still be challenging to apply a full analysis workflow to your own data for the first time.
This document details each steps of a prototype GATK analysis using hg19 mapping data. The data used here was generated using Illumina reads of the HapMap individual **NA18507** as described in 'Hands-On BITS session: *Mapping Illumina reads to the human genome and calling variations on chr21*'
Readers are expected to have done the full analysis using hands-on workflows and have access to some of the data and reference files used there. Additional data is required to use GATK tools including the GATK executable (java) as well as Broads **bundle** files matching the reference genome used in the training as detailed inline.
Please NOTE that GATK can be obtained by academic researchers for research purpose by accessing the Broad platform [1]. For profit and industrial users, a special agreement should be made with the Broad Institute prior to getting their code. This training session is meant for non-profit users that are entitled to use the GATK material
Resources and Reference links
!! Resources linked here might be outdated at the time you read this document.
Most of the information used here was copied from pages organized on the GATK online platform, The broad Institute should therefore receive full credit if you use this material elsewhere. In order to cite GATK, please follow instructions on their dedicated page: http://www.broadinstitute.org/gatk/about/citing-gatk
- GATK online help is accessible at [2].
- A PDF version of these pages is available at *http://www.broadinstitute.org//gatk/pdfdocs/*. Look for the latest version at the time your access the page.
- The Java code is available for academic at <http://www.broadinstitute.org/gatk/download>) and is installed by simply storing the uncompressed folder in the computer PATH.
- The Broad CRAN **gsalib** package is additionally required to generate plots in GATK. The package is found at <http://cran.r-project.org/web/packages/gsalib/index.html> and can be installed directly from within **RStudio** with the simple command '*install.packages("gsalib")*'. After installation, this package should be accessible for the GATK commands, if this is not the case, please refer to the GATK help and set your environment to make the **gsalib** code part of the standard path. The documentation can be accessed on the CRAN pages <http://cran.r-project.org/web/packages/gsalib/gsalib.pdf>.
- Finally, Broad **Bundle** files are required for annotating and filtering your data and can be found on the public FTP site. We provide a script to batch download these files in appendix. Additional reference files are also required but were already obtained for BITS NGS training.
- SnpEff used to extract variant subsets but that can do much more. Please refer to the online documentation for the full description. <http://snpeff.sourceforge.net/index.html>
- vcfTools already used previously to compute VENN diagram from the obtained and reference VCF files. <http://vcftools.sourceforge.net>.
A GATK Best Practice Workflow
The current Best practice for GATK is illustrated in the next figure and constitutes the framework for this tutorial.
The presented workflow is directly inspired from the GATK Best Practice slide reproduced above. Each step was extracted from the corresponding GATK online documentation and adapted to fit the location of our files. The right panel (grey), dedicated to downstream analysis is not part of this tutorial which aims at generating high quality data.
Define file path for the workflows
The path is adapted from that of the BITS hands-on NGS training.
project=/Users/bits/NGS_DNASeq-training base=${project}/Final_Results reference=${project}/ref/HiSeq_UCSC_hg19.fa mapping=${base}/hg19_mapping-full # create a folder to receive all results from this GATK workflow result=${base}/gatk mkdir -p ${result}
Add some indexing to our reference files (If not already available)
# create reference sequence dictionary if not present [ -f "${project}/ref/HiSeq_UCSC_hg19.dict" ] || \ java -Xmx6g -jar $PICARD/CreateSequenceDictionary.jar \ R=${reference} \ O=${project}/ref/HiSeq_UCSC_hg19.dict \ GENOME_ASSEMBLY=hg19 # create reference index if not present [ -f "${project}/ref/HiSeq_UCSC_hg19.fa.fai" ] || \ samtools faidx ${project}/ref/HiSeq_UCSC_hg19.fa
Prepare the GATK input data and Workflow elements
Clean the initial BWA bam data to make it Picard & GATK compliant
In this part, we will fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. These steps can be performed independently of each other but the order proposed here is recommended.
# Use the mapping file obtained with BWA (mem) ori-bam=${base}/hg19_mapping/NA18507_GAIIx_100_chr21_aln-pe.bam
Correct potential mate pair info and coordinate Sort the reads (requires Picard tools installed and running)
java -Xmx6g -jar $PICARD/FixMateInformation.jar \ I=${result}/${ori-bam} \ O=${result}/sorted_reads.bam \ SO=coordinate \ VALIDATION_STRINGENCY=LENIENT
Mark duplicate reads (optical duplicates could bias variant detection by adding excessive coverage depth at a variant locus
java -Xmx6g -jar $PICARD/MarkDuplicates.jar \ I=${result}/sorted_reads.bam \ O=${result}/dedup_reads.bam \ M=${result}/duplicate_metrics.txt
The result is a text file starting by:
12:47, 13 March 2014 (UTC)~~ ## net.sf.picard.metrics.StringHeader # net.sf.picard.sam.MarkDuplicates INPUT=[sorted_reads.bam] OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false ## net.sf.picard.metrics.StringHeader # Started on: Thu Oct 03 12:21:20 CEST 2013 ## METRICS CLASS net.sf.picard.sam.DuplicationMetrics LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE 37296 7379961 192976 8610 3101 1059 0.001001 13329615355 ## HISTOGRAM java.lang.Double BIN VALUE 1.0 1.000143 2.0 1.999733 3.0 2.99877 4.0 3.997254 5.0 4.995185 6.0 5.992563 ... 12:47, 13 March 2014 (UTC)~~
Add read group information required by GATK (adapted from our BWA command)
# ID:NA18507 # LB=lib-NA18507 # PU=unkn-0.0 # PL:ILLUMINA # SM:GAIIx-chr21-BWA.mem java -Xmx6g -jar $PICARD/AddOrReplaceReadGroups.jar \ INPUT=${result}/dedup_reads.bam \ OUTPUT=${result}/addrg_reads.bam \ RGID="NA18507" \ RGLB="lib-NA18507" \ RGPL="ILLUMINA" \ RGPU="unkn-0.0" \ RGSM="GAIIx-chr21-BWA.mem.gatk"
Index the last file for further use
java -Xmx6g -jar $PICARD/BuildBamIndex.jar \ INPUT=${result}/addrg_reads.bam
Create symbolic links to reference data required for the GATK workflow
A number of GATK bundle files are necessary to run this workflow. The files for build hg19 were obtained from the GATK ftp site and are defined below. The full bundle can be obtained with the script **wget-bundle.sh** attached in appendix-01
# BUNDLE: locate important files for GATK analysis ver=2.5 bundle=$BIODATA/bundle_${ver}/hg19 # INDEL gold standards mills=${bundle}/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz phase1indel=${bundle}/1000G_phase1.indels.hg19.vcf.gz # pick one of the former two as gold_indels set gold_indels=${mills} # SNP gold standards dbsnp=${bundle}/dbsnp_137.hg19.vcf.gz hapmap=${bundle}/hapmap_3.3.hg19.vcf.gz phase1snp=${bundle}/1000G_phase1.snps.high_confidence.hg19.vcf.gz omni=${bundle}/1000G_omni2.5.hg19.vcf.gz
Improve mapping by local realignment (indels)
Identify regions for indel local realignment of the selected chromosome
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R ${reference} \ -I ${result}/addrg_reads.bam \ -L chr21 \ -known ${gold_indels} \ -o ${result}/target_intervals.list
The target_intervals.list starts by:
12:47, 13 March 2014 (UTC)~~ chr21:9442356-9442357 chr21:9466976-9466977 chr21:9467571-9467572 chr21:9467834-9467835 chr21:9470889-9470890 chr21:9474246-9474247 chr21:9476537-9476609 ... 12:47, 13 March 2014 (UTC)~~
Perform indel local realignment of the selected chromosome
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R ${reference} \ -I ${result}/addrg_reads.bam \ -L chr21 \ -targetIntervals ${result}/target_intervals.list \ -known ${gold_indels} \ -o ${result}/realigned_reads.bam
Analyze patterns of covariation in the sequence dataset
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R ${reference} \ -I ${result}/realigned_reads.bam \ -L chr21 \ -knownSites ${dbsnp} \ -knownSites ${gold_indels} \ -o ${result}/recal_data.table
Do a second pass to analyze covariation remaining after recalibration
Use the same BAM data as in the first iteration
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R ${reference} \ -I ${result}/realigned_reads.bam \ -L chr21 \ -knownSites ${dbsnp} \ -knownSites ${gold_indels} \ -BQSR ${result}/recal_data.table \ -o ${result}/post_recal_data.table
Generate before/after plots
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T AnalyzeCovariates \ -R ${reference} \ -L chr21 \ -before ${result}/recal_data.table \ -after ${result}/post_recal_data.table \ -plots ${result}/recalibration_plots.pdf
Apply the recalibration to your sequence data
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T PrintReads \ -R ${reference} \ -L chr21 \ -I ${result}/realigned_reads.bam \ -BQSR ${result}/recal_data.table \ -o ${result}/recal_reads.bam
Compress your sequence data
The BAM file obtained in the last step is about twice as big as the initial BAM. IN this step, the size is brought back to a fraction of this initial size (about half). It should also speed-up analysis by other GATK tools.
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T ReduceReads \ -R ${reference} \ -I ${result}/recal_reads.bam \ -L chr21 \ -o ${result}/reduced_reads.bam
Call all variants using the HaplotypeCaller
Call variants in your sequence data (diploid genome => HaplotypeCaller)
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R ${reference} \ -I ${result}/reduced_reads.bam \ -L chr21 \ --genotyping_mode DISCOVERY \ -stand_emit_conf 10 \ -stand_call_conf 30 \ -o ${result}/raw_variants.vcf
Recalibrate variant scores in two steps: SNP then INDELS
During this step, additional annotations are added to the data that allow later filtering and selection of subsets. Some of the available annotations are listed here.
- Build the SNP recalibration model
- also add annotations from sources out of:
- BaseQualityRankSumTest (BaseQRankSum)
- DepthOfCoverage (DP)
- FisherStrand (FS)
- HaplotypeScore (HaplotypeScore)
- MappingQualityRankSumTest (MQRankSum)
- MappingQualityZero (MQ0)
- QualByDepth (QD)
- ReadPositionRankSumTest (ReadPosRankSum)
- RMSMappingQuality (MQ)
- SnpEff: Add genomic annotations using the third-party tool SnpEff with VariantAnnotator
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R ${reference} \ -input ${result}/raw_variants.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${hapmap} \ -resource:omni,known=false,training=true,truth=false,prior=12.0 ${omni} \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 ${phase1snp} \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${dbsnp} \ -an DP \ -an QD \ -an FS \ -an MQRankSum \ -an ReadPosRankSum \ -mode SNP \ -tranche 100.0 \ -tranche 99.9 \ -tranche 99.0 \ -tranche 90.0 \ -numBad 1000 \ -recalFile ${result}/recalibrate_SNP.recal \ -tranchesFile ${result}/recalibrate_SNP.tranches \ -rscriptFile ${result}/recalibrate_SNP_plots.R
Again, a number of plots are generated by R
Apply SNP recalibration
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R ${reference} \ -input ${result}/raw_variants.vcf \ -mode SNP \ --ts_filter_level 99.0 \ -recalFile ${result}/recalibrate_SNP.recal \ -tranchesFile ${result}/recalibrate_SNP.tranches \ -o ${result}/recalibrated_snps_raw_indels.vcf
Detailed plots are provided for each step of the recalibration. We only reproduce here page #1 of 10 similar pannels
Build the Indel recalibration model
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R ${reference} \ -input ${result}/recalibrated_snps_raw_indels.vcf \ -resource:mills,known=true,training=true,truth=true,prior=12.0 ${mills} \ -an DP \ -an FS \ -an MQRankSum \ -an ReadPosRankSum \ -mode INDEL \ -tranche 100.0 \ -tranche 99.9 \ -tranche 99.0 \ -tranche 90.0 \ -numBad 1000 \ --maxGaussians 4 \ -recalFile ${result}/recalibrate_INDEL.recal \ -tranchesFile ${result}/recalibrate_INDEL.tranches \ -rscriptFile ${result}/recalibrate_INDEL_plots.R
Again, a number of plots are generated by R
Apply the desired level of recalibration to the Indels in the call set
java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R ${reference} \ -input ${result}/recalibrated_snps_raw_indels.vcf \ -mode INDEL \ --ts_filter_level 99.0 \ -recalFile ${result}/recalibrate_INDEL.recal \ -tranchesFile ${result}/recalibrate_INDEL.tranches \ -o ${result}/recalibrated_variants.vcf
Detailed plots are provided for each step of the recalibration. We only reproduce here page #1 of 10 similar panels
The content of the file can be estimated based on the FILTER column
grep -v "^#" recalibrated_variants.vcf | cut -f 7 | sort | uniq -c # 940 LowQual # 77239 PASS # 365 VQSRTrancheINDEL99.00to99.90 # 836 VQSRTrancheINDEL99.90to100.00 # 4291 VQSRTrancheSNP99.00to99.90 # 6667 VQSRTrancheSNP99.90to100.00
Filter high quality data for comparison analysis
The double-recalibrated file contains a FILTER field (column #7) that can be used to extract a high-quality subset
Filter high quality variants using snpEff
# equivalent to filter "( ! FILTER='LowQual' )" cat ${result}/recalibrated_variants.vcf | \ java -Xmx6g -jar /opt/biotools/snpEff/SnpSift.jar \ filter "( na FILTER ) | (FILTER = 'PASS' | FILTER =~ 'VQSRT')" \ | bgzip -c > ${result}/hq_recalibrated_variants.vcf.gz && \ tabix -p vcf ${result}/hq_recalibrated_variants.vcf.gz
Perform QC on the VCF calls
Two tools can be used that are part of VCFTools and HTSLIB
The older command vcf-stats will generate text files that can be consulted for a given metric.
vcf-stats recalibrated_variants.vcf.gz -p recalibrated_variants_stats/ vcf-stats hq_recalibrated_variants.vcf.gz -p hq_recalibrated_variants_stats/
The current version of HTSLib produces nicer output already seen in the BITS training.
htscmd vcfcheck recalibrated_variants.vcf.gz > recalibrated_variants.chk htscmd vcfcheck hq_recalibrated_variants.vcf.gz > hq_recalibrated_variants.chk plot-vcfcheck recalibrated_variants.chk -p recalibrated_variants_plots/ plot-vcfcheck hq_recalibrated_variants.chk -p hq_recalibrated_variants_plots/
For all GATK calls
For the High Quality subset
Comparing GATK and Samtools calls with gold standard NA15807 variant lists
intersect results with former Training results
casava=public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz bcftools=${base}/samtools_variants/chr21_mem_var.flt-D1000.vcf.gz hapmap=public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz gatk_raw=${result}/recalibrated_variants.vcf.gz gatk_hq=${result}/hq_recalibrated_variants.vcf.gz vcf-compare ${casava} \ ${bcftools} \ ${gatk_raw} \ ${gatk_hq} | grep "^VN" > ${result}/compare-4_2gatk.cmp
vcf-compare ${casava} \ ${bcftools} \ ${gatk_hq} \ ${hapmap} | grep "^VN" > ${result}/compare-4_2hapmap.cmp
Discussion and Conclusion
The full processing of the BWA mem mapping data for NA18507 and its comparison with the results available from reference sources and from the BITS mapping and calling experiment are very similar. A few % of the calls are private to one caller or to the public data. The comparison with the HapMap data (re validated information) shows a similar sensitivity for both methods.
The time necessary to perform this full workflow is of about 1/2 day while the corresponding part using **samtools** and **bcftools** runs in a few minutes. This time difference will be even more pronounced when dealing with a full genome analysis.
By contrast, GATK being accepted as the cleanest way to call high quality variant might motivate spending this extra time on the final stages of NGS DNA variant analysis
Another possibility, applied by many groups would be to take the intersection of both methods as a highest possible prediction quality in the case of an unknown genome.
In conclusion, we show here that applying GATK to call variants is a tedious but doable process. It requires time and computer power but can be run on a reasonable computer (strong laptop).
What was not explored here is the possibility to split this task in parallel jobs when doing full genome analysis of when analyzing multiple genomes. The GATK toolkit is provided with an additional toolbox called **QUEUE** that was not tried here but is meant to accelerate routine or heavy load analyses.
Download Hands-on data and results
Download hands-on data from the BITS server
References:
- ↑ http://www.broadinstitute.org/gatk/
- ↑ http://www.broadinstitute.org/gatk/guide/topic?name=tutorials
[ Main_Page ]