NGS-Var Exercise.5

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.4 | NGS-Var Exercise.6 ]


Compare and intersect VCF files using VcfTools and bedtools


ex05_wf.png

compare variants called by bcftools and varscan with vcftools

The BWA 'mem' algorithms was used to align the NA18507 reads and generated BAM mapping fed to samtools+bcftools and to samtools+varscan to call variants. The resulting variants VCF-files were sorted, compressed, indexed are finally compared here using vcftools.

#! /usr/bin/env bash
 
# vcftools_compare-vcf.sh
infile1=bcftools_htslib_variants/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz
infile2=varscan2_variants/chr21_NA18507_varscan.vcf.gz
 
outfolder=VCF_comparison
mkdir -p ${outfolder}
 
# run vcftools diff to identify differences
vcftools --gzvcf ${infile1} \
	--gzdiff ${infile2} \
	--diff-site \
	--out ${outfolder}/chr21_NA18507_bcftools-vs-varscan

command results

> cat  ${outfolder}/chr21_NA18507_bcftools-vs-varscan.log
 
VCFtools - 0.1.14
(C) Adam Auton and Anthony Marcketta 2009
 
Parameters as interpreted:
        --gzvcf bcftools_htslib_variants/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz
        --out VCF_comparison/chr21_NA18507_bcftools-vs-varscan
        --gzdiff varscan2_variants/chr21_NA18507_varscan.vcf.gz
        --diff-site
 
Using zlib version: 1.2.7
After filtering, kept 1 out of 1 Individuals
Using zlib version: 1.2.7
Comparing sites in VCF files...
Found 70285 sites common to both files.
Found 7218 sites only in main file.
Found 3814 sites only in second file.
Found 7361 non-matching overlapping sites.
After filtering, kept 84864 out of a possible 84864 Sites
Run Time = 1.00 seconds
Error creating thumbnail: Unable to save thumbnail to destination
  • The concordance between both calling methods for chr21 calls is 70285/77503 (90.7% of bcftools data) and 70285/74099 (94.9% of varscan2 data) respectively.
  • The bcftools method called slightly more variants than the varscan2 method (may differ for other filtering parameters)
# a detailed report is provided for differences in tabular format
infolder=VCF_comparison
 
head ${infolder}/chr21_NA18507_bcftools-vs-varscan.diff.sites_in_files | \
	column -t

command results

# the 'IN_FILE' column tells in which file(s) a call is found (1, 2, Both)
# a simple grep command is easy to isolate either of the subsets
CHROM  POS1     POS2     IN_FILE  REF1  REF2  ALT1  ALT2
chr21  9467416  9467416  B        C     C     T     T
chr21  9467417  9467417  B        A     A     C     C
chr21  9471670  .        1        A     .     G     .
chr21  9472902  .        1        T     .     C     .
chr21  9472931  .        1        T     .     G     .
chr21  9473159  .        1        A     .     G     .
chr21  9473186  .        1        T     .     C     .
chr21  9473193  .        1        T     .     G     .
chr21  9474141  .        1        T     .     G     .

Another example extracting variants specific to file 2 (varscan2 calls)

gawk 'BEGIN{FS="\t"; OFS="\t"}{if (($3 ~/IN_FILE/) || ($3~/2/)) print $0}' \
   ${infolder}/chr21_NA18507_bcftools-vs-varscan.diff.sites_in_files | head -5

command results

CHROM   POS1    POS2    IN_FILE REF1    REF2    ALT1    ALT2
chr21   9476620 9476620 B       G       G       A       A
chr21   9476621 9476621 B       C       C       T       T
chr21   9476732 9476732 B       T       T       C       C
chr21   9477002 9477002 B       T       T       C       C

It is clear that the choice of mapping algorithms as well as filtering for calls affects the results. Inspection will probably often reveal lower confidence for the divergent calls than for identical calls.

compare bcftools & varscan2 results with vcf-compare

Another tool to compare VCF files is vcf-compare that produces values directly usable to draw VENN diagrams. The command can look at positions only (less stringent) or at position + genotype. A useful R package to plot such Venn diagrams is presented in appendix but not used in this document.

The simple comparison without genotype is applied below.


vcf-compare, compare VCF files

About: Compare bgzipped and tabix indexed VCF files. (E.g. bgzip file.vcf; tabix -p vcf file.vcf.gz)
Usage: vcf-compare [OPTIONS] file1.vcf file2.vcf ...
       vcf-compare -p plots chr1.cmp chr2.cmp ...
Options:
   -a, --apply-filters                 Ignore lines where FILTER column is anything else than PASS or '.'
   -c, --chromosomes <list|file>       Same as -r, left for backward compatibility. Please do not use as it will be dropped in the future.
   -d, --debug                         Debugging information. Giving the option multiple times increases verbosity
   -g, --cmp-genotypes                 Compare genotypes, not only positions
       --ignore-indels                 Exclude sites containing indels from genotype comparison
   -m, --name-mapping <list|file>      Use with -g when comparing files with differing column names. The argument to this options is a
                                           comma-separated list or one mapping per line in a file. The names are colon separated and must
                                           appear in the same order as the files on the command line.
       --INFO <string> [<int>]         Calculate genotype errors by INFO. Use zero based indexes if field has more than one value. Can be
                                           given multiple times.
   -p, --plot <prefix>                 Create plots. Multiple files (e.g. per-chromosome outputs from vcf-compare) can be given.
   -R, --refseq <file>                 Compare the actual sequence, not just positions. Use with -w to compare indels.
   -r, --regions <list|file>           Process the given regions (comma-separated list or one region per line in a file).
   -s, --samples <list|file>           Process only the listed samples. Excluding unwanted samples may increase performance considerably.
   -t, --title <string>                Title for graphs (see also -p)
   -w, --win <int>                     In repetitive sequences, the same indel can be called at different positions. Consider
                                           records this far apart as matching (be it a SNP or an indel).
   -h, -?, --help                      This help message.
#! /usr/bin/env bash
# vcf-compare_compare-vcf.sh
 
infile1=bcftools_htslib_variants/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz
infile2=varscan2_variants/chr21_NA18507_varscan.vcf.gz
 
outfolder=VCF_comparison
mkdir -p ${outfolder}
 
vcf-compare ${infile1} \
	${infile2} | \
	grep ^VN | cut -f 2- | column -t \
	> ${outfolder}/vcf-compare_bcftools-vs-varscan2.txt

command results

4080   varscan2_variants/chr21_NA18507_varscan.vcf.gz                        (5.0%)
7484   bcftools_htslib_variants/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz  (8.8%)
77380  bcftools_htslib_variants/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz  (91.2%)  
	varscan2_variants/chr21_NA18507_varscan.vcf.gz  (95.0%)

The obtained counts were used in 2DVenn.R to plot the next figure. This Venn diagram is only slightly different from the previous one, possibly due to the reported error: 'Non-matching REF. Skipping all such sites.'.

vcftools venn diagram

2DVenn.R -a 7484 -b 4080 -i 77380 -t "VcfTools comparison" \
	-x 1 -u 2 -A bcftools -B varscan2 -o ${outfolder}/vcftools_venn


vcftools_venn.png

compare bcftools & varscan2 results with bedtools

Using BedTools is also possible and exclusively based on position information. To reproduce the former vcftools query, we need three runs; one for each side-specific region and one for the intersect.

Error creating thumbnail: Unable to save thumbnail to destination
We must add back a valid header to the result which otherwise comes as data-only and triggers errors in later commands, this is done by adding the '-header' argument to the bedtools commands
#! /usr/bin/env bash
# bedtools_compare-vcf.sh
 
afile=bcftools_htslib_variants/chr21_NA18507_var_bcftools.flt-D1000.vcf.gz
bfile=varscan2_variants/chr21_NA18507_varscan.vcf.gz
 
outfolder=VCF_comparison
mkdir -p ${outfolder}
 
# finding intersect calls
bedtools intersect -header -a ${afile} -b ${bfile} \
	> ${outfolder}/common_ab.vcf 
 
# finding calls that are unique for each mapper (based on coordinates).
bedtools subtract -header -a ${afile} -b ${outfolder}/common_ab.vcf \
	> ${outfolder}/unique_a.vcf
 
bedtools subtract -header -a ${bfile} -b ${outfolder}/common_ab.vcf \
	> ${outfolder}/unique_b.vcf
 
# report each counts from chr21 calls (excluding comment lines)
# note the reformatting trick using the 'column' command
## 'column -t' gets with '<' the output of the whole pipeline between '()' as input
column -t <( for f in ${afile} \
	${bfile} \
	${outfolder}/common_ab.vcf \
	${outfolder}/unique_a.vcf \
	${outfolder}/unique_b.vcf; do
	echo "> "$(basename ${f})" => "$(zgrep -c '^chr21' ${f})
	done )

This code Returns data used in 2DVenn.R to plot the next figure; Note that bedtools works with coordinate overlap and that totals is not always the sum of the variant counts (eg a SNP and an overlapping indel do return '1' as overlap width but the indel can be several bps wide)

command results

>  chr21_NA18507_var_bcftools.flt-D1000.vcf.gz  =>  84864
>  chr21_NA18507_varscan.vcf.gz                 =>  81460
>  common_ab.vcf                                =>  77903
>  unique_a.vcf                                 =>  7217
>  unique_b.vcf                                 =>  3681

bedtools venn diagram

2DVenn.R -a 7217 -b 3681 -i 77903 -A bcftools -B varscan2 -t "Bedtools comparison" -o ${outfolder}/bedtools_venn -u 2 -x 1


bedtools_venn.png

download exercise files

Download exercise files here

Use the right application to open the files present in ex5-files

References:

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.4 | NGS-Var Exercise.6 ]