NGS-Var Exercise.5
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.4 | NGS-Var Exercise.6 ]
Compare and intersect VCF files using VcfTools and bedtools
Contents
compare variants called by bcftools and varscan with vcftools
The BWA 'mem' algorithms was used to align the NA18507 reads and generated BAM mapping fed to samtools+bcftools and to samtools+varscan to call variants. The resulting variants VCF-files were sorted, compressed, indexed are finally compared here using vcftools.
#! /usr/bin/env bash # vcftools_compare-vcf.sh infile1=bcftools_htslib_variants/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz infile2=varscan2_variants/chr21_NA18507_varscan.vcf.gz outfolder=VCF_comparison mkdir -p ${outfolder} # run vcftools diff to identify differences vcftools --gzvcf ${infile1} \ --gzdiff ${infile2} \ --diff-site \ --out ${outfolder}/chr21_NA18507_bcftools-vs-varscan
command results
> cat ${outfolder}/chr21_NA18507_bcftools-vs-varscan.log VCFtools - 0.1.14 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --gzvcf bcftools_htslib_variants/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz --out VCF_comparison/chr21_NA18507_bcftools-vs-varscan --gzdiff varscan2_variants/chr21_NA18507_varscan.vcf.gz --diff-site Using zlib version: 1.2.7 After filtering, kept 1 out of 1 Individuals Using zlib version: 1.2.7 Comparing sites in VCF files... Found 70285 sites common to both files. Found 7218 sites only in main file. Found 3814 sites only in second file. Found 7361 non-matching overlapping sites. After filtering, kept 84864 out of a possible 84864 Sites Run Time = 1.00 seconds
- The concordance between both calling methods for chr21 calls is 70285/77503 (90.7% of bcftools data) and 70285/74099 (94.9% of varscan2 data) respectively.
- The bcftools method called slightly more variants than the varscan2 method (may differ for other filtering parameters)
# a detailed report is provided for differences in tabular format infolder=VCF_comparison head ${infolder}/chr21_NA18507_bcftools-vs-varscan.diff.sites_in_files | \ column -t
command results
# the 'IN_FILE' column tells in which file(s) a call is found (1, 2, Both) # a simple grep command is easy to isolate either of the subsets CHROM POS1 POS2 IN_FILE REF1 REF2 ALT1 ALT2 chr21 9467416 9467416 B C C T T chr21 9467417 9467417 B A A C C chr21 9471670 . 1 A . G . chr21 9472902 . 1 T . C . chr21 9472931 . 1 T . G . chr21 9473159 . 1 A . G . chr21 9473186 . 1 T . C . chr21 9473193 . 1 T . G . chr21 9474141 . 1 T . G .
Another example extracting variants specific to file 2 (varscan2 calls)
gawk 'BEGIN{FS="\t"; OFS="\t"}{if (($3 ~/IN_FILE/) || ($3~/2/)) print $0}' \ ${infolder}/chr21_NA18507_bcftools-vs-varscan.diff.sites_in_files | head -5
command results
CHROM POS1 POS2 IN_FILE REF1 REF2 ALT1 ALT2 chr21 9476620 9476620 B G G A A chr21 9476621 9476621 B C C T T chr21 9476732 9476732 B T T C C chr21 9477002 9477002 B T T C C
It is clear that the choice of mapping algorithms as well as filtering for calls affects the results. Inspection will probably often reveal lower confidence for the divergent calls than for identical calls.
compare bcftools & varscan2 results with vcf-compare
Another tool to compare VCF files is vcf-compare that produces values directly usable to draw VENN diagrams. The command can look at positions only (less stringent) or at position + genotype. A useful R package to plot such Venn diagrams is presented in appendix but not used in this document.
The simple comparison without genotype is applied below.
vcf-compare, compare VCF files
Usage: vcf-compare [OPTIONS] file1.vcf file2.vcf ...
vcf-compare -p plots chr1.cmp chr2.cmp ...
Options:
-a, --apply-filters Ignore lines where FILTER column is anything else than PASS or '.'
-c, --chromosomes <list|file> Same as -r, left for backward compatibility. Please do not use as it will be dropped in the future.
-d, --debug Debugging information. Giving the option multiple times increases verbosity
-g, --cmp-genotypes Compare genotypes, not only positions
--ignore-indels Exclude sites containing indels from genotype comparison
-m, --name-mapping <list|file> Use with -g when comparing files with differing column names. The argument to this options is a
comma-separated list or one mapping per line in a file. The names are colon separated and must
appear in the same order as the files on the command line.
--INFO <string> [<int>] Calculate genotype errors by INFO. Use zero based indexes if field has more than one value. Can be
given multiple times.
-p, --plot <prefix> Create plots. Multiple files (e.g. per-chromosome outputs from vcf-compare) can be given.
-R, --refseq <file> Compare the actual sequence, not just positions. Use with -w to compare indels.
-r, --regions <list|file> Process the given regions (comma-separated list or one region per line in a file).
-s, --samples <list|file> Process only the listed samples. Excluding unwanted samples may increase performance considerably.
-t, --title <string> Title for graphs (see also -p)
-w, --win <int> In repetitive sequences, the same indel can be called at different positions. Consider
records this far apart as matching (be it a SNP or an indel).
-h, -?, --help This help message.
#! /usr/bin/env bash # vcf-compare_compare-vcf.sh infile1=bcftools_htslib_variants/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz infile2=varscan2_variants/chr21_NA18507_varscan.vcf.gz outfolder=VCF_comparison mkdir -p ${outfolder} vcf-compare ${infile1} \ ${infile2} | \ grep ^VN | cut -f 2- | column -t \ > ${outfolder}/vcf-compare_bcftools-vs-varscan2.txt
command results
4080 varscan2_variants/chr21_NA18507_varscan.vcf.gz (5.0%) 7484 bcftools_htslib_variants/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz (8.8%) 77380 bcftools_htslib_variants/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz (91.2%) varscan2_variants/chr21_NA18507_varscan.vcf.gz (95.0%)
The obtained counts were used in 2DVenn.R to plot the next figure. This Venn diagram is only slightly different from the previous one, possibly due to the reported error: 'Non-matching REF. Skipping all such sites.'.
vcftools venn diagram
2DVenn.R -a 7484 -b 4080 -i 77380 -t "VcfTools comparison" \ -x 1 -u 2 -A bcftools -B varscan2 -o ${outfolder}/vcftools_venn
compare bcftools & varscan2 results with bedtools
Using BedTools is also possible and exclusively based on position information. To reproduce the former vcftools query, we need three runs; one for each side-specific region and one for the intersect.
#! /usr/bin/env bash # bedtools_compare-vcf.sh afile=bcftools_htslib_variants/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz bfile=varscan2_variants/chr21_NA18507_varscan.vcf.gz outfolder=VCF_comparison mkdir -p ${outfolder} # finding intersect calls bedtools intersect -header -a ${afile} -b ${bfile} \ > ${outfolder}/common_ab.vcf # finding calls that are unique for each mapper (based on coordinates). bedtools subtract -header -a ${afile} -b ${outfolder}/common_ab.vcf \ > ${outfolder}/unique_a.vcf bedtools subtract -header -a ${bfile} -b ${outfolder}/common_ab.vcf \ > ${outfolder}/unique_b.vcf # report each counts from chr21 calls (excluding comment lines) # note the reformatting trick using the 'column' command ## 'column -t' gets with '<' the output of the whole pipeline between '()' as input column -t <( for f in ${afile} \ ${bfile} \ ${outfolder}/common_ab.vcf \ ${outfolder}/unique_a.vcf \ ${outfolder}/unique_b.vcf; do echo "> "$(basename ${f})" => "$(zgrep -c '^chr21' ${f}) done )
This code Returns data used in 2DVenn.R to plot the next figure; Note that bedtools works with coordinate overlap and that totals is not always the sum of the variant counts (eg a SNP and an overlapping indel do return '1' as overlap width but the indel can be several bps wide)
command results
> chr21_NA18507_var_bcftools.flt-D1000.vcf.gz => 84864 > chr21_NA18507_varscan.vcf.gz => 81460 > common_ab.vcf => 77903 > unique_a.vcf => 7217 > unique_b.vcf => 3681
bedtools venn diagram
download exercise files
Download exercise files here
References:
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.4 | NGS-Var Exercise.6 ]