NGS Exercise.6

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.5 | NGS Exercise.5_GATK | NGS Exercise.7 ]
# updated 2014 version


Compare and intersect VCF files


ex06_wf.png

Evaluate platform overlap for all available variant calls

compare two BWA mapping methods with the online hg18-mapped data

We first operate a rapid inspection of the different BAM files using samtools flagstat. Illumina provided chr21 read mapping obtained with their GA IIx deep sequencing platform <ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/NA18507_GAIIx_100_chr21.bam>, aligned to the b36/hg18 reference genome)

samtools flagstat for the hg18 data

infolder=ori-Illumina_hg18_mapping
outfolder=hg19_bwa-mapping
mkdir -p ${outfolder}
 
samtools flagstat ${infolder}/NA18507_GAIIx_100_chr21.bam > \
	${outfolder}/NA18507_GAIIx_100_chr21_ex1_flagstats.txt

original hg18 results

14990194 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
14085578 + 0 mapped (93.97%:-nan%)
14990194 + 0 paired in sequencing
7495097 + 0 read1
7495097 + 0 read2
12911388 + 0 properly paired (86.13%:-nan%)
13180962 + 0 with itself and mate mapped
904616 + 0 singletons (6.03%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Technical.png note the '904616' singletons resulting from the omission of a read from many pairs in the hg18 data

The same command applied to our hg19-remapped data (chr21 mappings only) gives similar results.

samtools flagstat for the BWA mappings using the aln method

infolder=hg19_bwa-mapping/full_chromosome
outfolder=hg19_bwa-mapping/full_chromosome
 
samtools flagstat ${infolder}/shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam > \
	${outfolder}/NA18507_GAIIx_100_chr21_aln-pe_mdup_flagstats.txt

BWA aln results

14770127 + 0 in total (QC-passed reads + QC-failed reads)
14124 + 0 duplicates
14740138 + 0 mapped (99.80%:nan%)
14770127 + 0 paired in sequencing
7385096 + 0 read1
7385031 + 0 read2
14690200 + 0 properly paired (99.46%:nan%)
14710149 + 0 with itself and mate mapped
29989 + 0 singletons (0.20%:nan%)
11843 + 0 with mate mapped to a different chr
11667 + 0 with mate mapped to a different chr (mapQ>=5)

Handicon.png Note that duplicates have been identified that were not marked in the starting material

samtools flagstat for the BWA mappings using the mem method

infolder=hg19_bwa-mapping/full_chromosome
outfolder=hg19_bwa-mapping/full_chromosome
 
samtools flagstat ${infolder}/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam > \
	${outfolder}/NA18507_GAIIx_100_chr21_mem-pe_mdup_flagstats.txt

BWA mem results

14861665 + 0 in total (QC-passed reads + QC-failed reads)
13154 + 0 duplicates
14845463 + 0 mapped (99.89%:nan%)
14861665 + 0 paired in sequencing
7431627 + 0 read1
7430038 + 0 read2
14700219 + 0 properly paired (98.91%:nan%)
14829215 + 0 with itself and mate mapped
16248 + 0 singletons (0.11%:nan%)
46189 + 0 with mate mapped to a different chr
42441 + 0 with mate mapped to a different chr (mapQ>=5)

We count 98% paired mapped reads with hg19 reference against 86% for the hg18 original data which is not really a surprise because of the improved nature of this more recent reference genome (additional loci and alternate gene alleles have been edited). The results for the mem algorithm are slightly higher (98.45% vs 98.25%) as a matter of mapping yield (whatever this means!)

Handicon.png Take home message: the choice of the reference genome is key to your final result. A good practice is to use a reasonably recent reference genome for which detailed annotation data is available

compare variants called from BWA 'aln' and BWA 'mem' mappings with vcftools

The BWA 'aln' and 'mem' algorithms were both aligned to the NA18507 reads and generated two mapping files fed to samtools to call variants. The resulting variants VCF-files were sorted, compressed, indexed and finally compared using vcftools.

#! /usr/bin/env bash
infolder=bcftools_samtools_variants
 
# run vcftools diff to identify differences
vcftools --gzvcf ${infolder}/chr21_NA18507_aln_var.flt-D1000.vcf.gz \
--gzdiff ${infolder}/chr21_NA18507_mem_var.flt-D1000.vcf.gz \
--diff-site \
--out compare_chr21_NA18507_aln-mem
mv compare* bcftools_samtools_variants/

command results

VCFtools - v0.1.13
(C) Adam Auton and Anthony Marcketta 2009
 
Parameters as interpreted:
	--gzvcf bcftools_samtools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz
	--out bcftools_samtools_variants/compare_chr21_aln-mem/compare_chr21_NA18507_aln-mem
	--gzdiff bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz
	--diff-site
 
Using zlib version: 1.2.8
After filtering, kept 1 out of 1 Individuals
Using zlib version: 1.2.8
Comparing sites in VCF files...
Found 81235 sites common to both files.
Found 1748 sites only in main file.
Found 3114 sites only in second file.
Found 436 non-matching overlapping sites.
After filtering, kept 83419 out of a possible 83419 Sites
Run Time = 1.00 seconds
Handicon.png
  • The concordance between both mapping methods for chr21 calls is 81240/83416 (97.4% of aln data) and 81240/84799 (95.8% of mem data) respectively.
  • The mem method called slightly more variants than the aln method (known fact)
# a detailed report is provided for differences in tabular format
infolder=bcftools_samtools_variants
 
head ${infolder}/compare_chr21_NA18507_aln-mem.diff.sites_in_files | \
	column -t

command results

# the 'IN_FILE' column tells in which file(s) a call is found (1, 2, Both)
# a simple grep command is easy to isolate either of the subsets
CHROM  POS1     POS2     IN_FILE  REF1  REF2  ALT1  ALT2
chr21  9467416  9467416  B        C     C     T     T
chr21  9467417  9467417  B        A     A     C     C
chr21  9471670  9471670  B        A     A     G     G
chr21  9472902  9472902  B        T     T     C     C
chr21  9472931  9472931  B        T     T     G     G
chr21  9473159  9473159  B        A     A     G     G
chr21  9473186  9473186  B        T     T     C     C
chr21  9473193  9473193  B        T     T     G     G
chr21  9473939  .        1        T     .     C     .

Another example extracting variants specific to file 2 (mem method)

gawk 'BEGIN{FS="\t"; OFS="\t"}{if (($3 ~/IN_FILE/) || ($3~/2/)) print $0}' \
   ${infolder}/compare_chr21_NA18507_aln-mem.diff.sites_in_files | head -5

command results

CHROM	POS1	POS2	IN_FILE	REF1	REF2	ALT1	ALT2
chr21	9472902	9472902	B	T	T	C	C
chr21	9472931	9472931	B	T	T	G	G
chr21	9476521	9476521	B	G	G	T	T
chr21	9476620	9476620	B	G	G	A	A

It is clear that the choice of mapping algorithms as well as filtering for calls affects the results. Inspection will probably often reveal lower confidence for the divergent calls than for identical calls.

compare aln & mem results with vcf-compare

Another tool to compare VCF files is vcf-compare that produces values directly usable to draw VENN diagrams. The command can look at positions only (less stringent) or at position + genotype. A useful R package to plot such Venn diagrams is presented in appendix but not used in this document. The first method is shown below.

vcf-compare, compare VCF files

About: Compare bgzipped and tabix indexed VCF files. (E.g. bgzip file.vcf; tabix -p vcf file.vcf.gz)
Usage: vcf-compare [OPTIONS] file1.vcf file2.vcf ...
       vcf-compare -p plots chr1.cmp chr2.cmp ...
Options:
   -a, --apply-filters                 Ignore lines where FILTER column is anything else than PASS or '.'
   -c, --chromosomes <list|file>       Same as -r, left for backward compatibility. Please do not use as it will be dropped in the future.
   -d, --debug                         Debugging information. Giving the option multiple times increases verbosity
   -g, --cmp-genotypes                 Compare genotypes, not only positions
       --ignore-indels                 Exclude sites containing indels from genotype comparison
   -m, --name-mapping <list|file>      Use with -g when comparing files with differing column names. The argument to this options is a
                                           comma-separated list or one mapping per line in a file. The names are colon separated and must
                                           appear in the same order as the files on the command line.
       --INFO <string> [<int>]         Calculate genotype errors by INFO. Use zero based indecies if field has more than one value. Can be
                                           given multiple times.
   -p, --plot <prefix>                 Create plots. Multiple files (e.g. per-chromosome outputs from vcf-compare) can be given.
   -R, --refseq <file>                 Compare the actual sequence, not just positions. Use with -w to compare indels.
   -r, --regions <list|file>           Process the given regions (comma-separated list or one region per line in a file).
   -s, --samples <list|file>           Process only the listed samples. Excluding unwanted samples may increase performance considerably.
   -t, --title <string>                Title for graphs (see also -p)
   -w, --win <int>                     In repetitive sequences, the same indel can be called at different positions. Consider
                                           records this far apart as matching (be it a SNP or an indel).
   -h, -?, --help                      This help message.
infolder=bcftools_samtools_variants
 
aln=${infolder}/chr21_NA18507_aln_var.flt-D1000.vcf.gz
mem=${infolder}/chr21_NA18507_mem_var.flt-D1000.vcf.gz
 
vcf-compare ${aln} \
	${mem} | \
	grep ^VN | cut -f 2- | column -t \
	> ${infolder}/vcf-compare_aln-mem.txt

command results

2110   chr21_NA18507_aln_var.flt-D1000.vcf.gz  (2.5%)
3476   chr21_NA18507_mem_var.flt-D1000.vcf.gz  (4.1%)
81309  chr21_NA18507_aln_var.flt-D1000.vcf.gz  (97.5%)  \
       chr21_NA18507_mem_var.flt-D1000.vcf.gz  (95.9%)

The obtained counts were used in 2DVenn.R to plot the next figure. This Venn diagram is only slightly different from the previous one, possibly due to the reported error: 'Non-matching REF. Skipping all such sites.'.

vcftools venn diagram

2DVenn.R -a 2110 -b 3476 -i 81309 -A "BWA aln" -B "BWA mem" \
   -o ${infolder}/bcftools_0.1.19_calls -t "bcftools_0.1.19 calls" -x 1 -u 2


bcftools_0.1.19_calls.png

compare aln & mem results with bedtools

Using BedTools is also possible and exclusively based on position information. To reproduce the former vcftools query, we need three runs; one for each side-specific region and one for the intersect.

Handicon.png We must add back a valid header to the result which otherwise comes as data-only and triggers errors in later commands, this is done by adding the '-header' argument to the bedtools commands

# uses bedtools intersect and bedtools subtract
infolder=bcftools_samtools_variants
outfolder=vcf-venn/bedtools_cmp
mkdir -p ${outfolder}
 
afile=${infolder}/chr21_NA18507_aln_var.flt-D1000.vcf.gz
bfile=${infolder}/chr21_NA18507_mem_var.flt-D1000.vcf.gz
 
# finding intersect calls
bedtools intersect -header -a ${afile} -b ${bfile} > ${outfolder}/common_ab.vcf
 
# finding calls that are unique for each mapper (based on coordinates).
bedtools subtract -header -a ${afile} -b ${outfolder}/common_ab.vcf > \
	${outfolder}/unique_a.vcf
 
bedtools subtract -header -a ${bfile} -b ${outfolder}/common_ab.vcf > \
	${outfolder}/unique_b.vcf
 
# report each counts from chr21 calls (excluding comment lines)
# note the reformatting trick using the 'column' command
## 'column -t' gets with '<' the output of the whole pipeline between '()' as input
column -t <( for f in ${afile} \
		${bfile} \
		${outfolder}/common_ab.vcf \
		${outfolder}/unique_a.vcf \
		${outfolder}/unique_b.vcf; do
		echo "> "$(basename ${f})" => "$(zgrep -c '^chr21' ${f})
		done )

This code Returns data used in 2DVenn.R to plot the next figure; Note that bedtools works with coordinate overlap and that totals is not always the sum of the variant counts (eg a SNP and an overlapping indel do return '1' as overlap width but the indel can be several bps wide)

command results

>  bcftools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz  =>  83416
>  bcftools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz  =>  84799
>  vcf-venn/bedtools_cmp/common_ab.vcf                       =>  81766
>  vcf-venn/bedtools_cmp/unique_a.vcf                        =>  1667
>  vcf-venn/bedtools_cmp/unique_b.vcf                        =>  3590

bedtools venn diagram

bedtools_venn.png

REM: Our calls were compared to calls made by the GATK standard haplotypeCaller pipeline. The results are presented in a separate Tutorial for your information (GATK HaplotypeCaller Analyzing of BWA (mem) mapped Illumina reads).

compare bcftools_samtools, bcftools_htslib, and Varscan2 with vcf-compare

infolder1=bcftools_samtools_variants
infolder2=bcftools_htslib_variants-bowtie2
infolder3=varscan2_variants
 
# we should make this a loop, try if you like ;-)
 
vcf-compare ${infolder1}/chr21_NA18507_aln_var.flt-D1000.vcf.gz \
	${infolder2}/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz \
	${infolder3}/NA18507_aln_varscan_chr21.vcf.gz | \
	grep ^VN | cut -f 2- | column -t \
	> ${infolder3}/compare3_${met}.txt
 
<syntaxhighlight lang="text" sep="div">
29     bcftools_htslib_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz    (0.0%)   varscan2_variants/NA18507_aln_varscan_chr21.vcf.gz                 (0.0%)
33     bcftools_samtools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz  (0.0%)   varscan2_variants/NA18507_aln_varscan_chr21.vcf.gz                 (0.0%)
35     bcftools_htslib_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz    (0.0%)
60     bcftools_samtools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz  (0.1%)
3944   varscan2_variants/NA18507_aln_varscan_chr21.vcf.gz                 (4.9%)
6635   bcftools_htslib_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz    (8.0%)   bcftools_samtools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz  (8.0%)
76691  bcftools_htslib_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz    (92.0%)  bcftools_samtools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz  (91.9%)  varscan2_variants/NA18507_aln_varscan_chr21.vcf.gz  (95.0%)

This code Returns data used in 3DVenn.R to plot the next figure

varscan2 vs bcftools and htslib- venn diagrams
compare3-aln.png

3DVenn.R -a 60 -b 35 -c 3944 -d 6635 -e 33 -f 29 -i 76691 -A "bcftools_0.1.19" -B "bcftools_0.2.0" -C "varscan2" -t "compare aln calls" -u 2 -x 1 -o compare3-aln

Technical.png As seen from the Venn diagram, varscan2 calls additional (private) variants as compared to the samtools-bcftools pipeline and 'misses' twice as many samtools private variants. This is normal since varscan2 takes a non-probabilistic approach. It should be noted here that varscan2 was applied with standard filtering limits (built-in parameters) and that more stringent runs could be performed that will likely reduce the number of private calls found here

Estimate variant concordance between different sets

We have here BWA re-mapped data (BITS set), Illumina calls (as provided), and the 1000g calls (1kg download) for the same individual and with hg19 as reference genome. In addition to our remapped data, we collected results obtained for DNA from the same cell-line (Coriel EBV-transformed B-Lymphocyte GM18507) by two main NGS platform/providers.

Technical.png We do not include here calls made by varscan2 but you have all necessary data to do it in Final_Results and are welcome to do this additional comparison

Complete Genomics high quality variant calls

Data obtained from the Complete Genomics re-sequencing project is available from the 1000genomes ftp repository (<http://www.1000genomes.org/announcements/complete-genomics-data-release-2013-07-26>). Usually, Complete Genomics is of high quality due to their sequencing method and unique technology.

A folder is dedicated to NA18507 from which a vcf file was fetched and the chr21 subset extracted and analyzed.

# print some QC info
infolder=public_info/CG_variants
 
bcftools stats ${infolder}/NA18507_CG_SRR822930.chr21.vcf.gz > \
	${infolder}/NA18507_CG.chk
 
plot-vcfstats  -t "NA18507_CG-calls" \
	-p ${infolder}/CG_plots/ \
	${infolder}/NA18507_CG.chk
Complete Genomics variants
plot-vcfcheck_NA18507_CG-0.png
plot-vcfcheck_NA18507_CG-1.png
plot-vcfcheck_NA18507_CG-2.png
plot-vcfcheck_NA18507_CG-3.png

The ts/tv ratio observed in CG calls is above 2.0 (2.17) which is a very good score confirming the known high quality of the CG calls in general and in this experiment.

Illumina BaseSpace demo data and variant calls

Ilumina BaseSpace can be accessed or free and contains data for NA18507. <https://basespace.illumina.com/project/7/appsession/199/files/tree/Resequencing-NA18507/BySample/1/data/intensities/basecalls/Alignment/genome.vcf?appResultId=2169&id=965975>.

The NA18507 VCF call file with 80513 calls on chr21 can be downloaded from the web interface.

# print some QC info for the illumina data
infolder=public_info/Illumina_BaseSpace
 
bcftools_0.2.0 stats ${infolder}/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz > \
	${infolder}/basespace_all.chk
 
plot-vcfstats_0.2.0  -t "NA18507_basespace-calls" \
	-p ${infolder}/basespace_all_plots/ \
	${infolder}/basespace_all.chk
Illumina_BaseSpace variants
plot-vcfcheck_Illumina_BaseSpace-0.png
plot-vcfcheck_Illumina_BaseSpace-1.png
plot-vcfcheck_Illumina_BaseSpace-2.png
plot-vcfcheck_Illumina_BaseSpace-3.png

The obtained ts/tv value of 2.04 is rather good.

The official 1000 genome data for NA18507

The 1000 genome project allowed identifying a huge number of variants in genomes from healthy donnors originating from different populations. When applicable, filtering was done to retain only variants that show mendelian inheritance in children (thereby of better confidence but not comprising de-novo calls). We extract here the subset of validated variants called in NA18506 by the genome project.

Data is available in the Broad bundle and were obtained previously by analyzing calls within trios and keeping only those that were clearly transmitted from parent(s) and therefor represent best quality candidates. The calls for the NA18507 genome can be extracted as follows.

Please do not run the next command, it takes several minutes to run and the result is provided

# get NA18507 calls (-e => non-ref)
# this to avoid storing reference calls for that genome
infolder=public_info/NA18507_hapmap_3.3.hg19
 
zcat ${infolder}/hapmap_3.3.hg19.vcf.gz | \
	vcf-subset -e -c NA18507 | bgzip -c > \
	${infolder}/NA18507_hapmap_3.3.hg19.vcf.gz && \
	tabix -f -p vcf ${infolder}/NA18507_hapmap_3.3.hg19.vcf.gz

In order to extract only calls on chr21, we need to parse the variants and do some filtering. In the first command, the zgrep -E '(^#|^chr21\b)' ... allows to keep both lines starting (^) with # OR with chr21. Another goodie with (z)grep that can use regular expressions.

# extract the chr21 subset from the full-genome list
# keep only lines starting by '#' of first field=='chr21'
infolder=public_info/NA18507_hapmap_3.3.hg19
 
zgrep -E '(^#|^chr21\b)' ${infolder}/NA18507_hapmap_3.3.hg19.vcf.gz | \
	bgzip -c > ${infolder}/chr21_NA18507_hapmap_3.3.hg19.vcf.gz && \
	tabix -p vcf ${infolder}/chr21_NA18507_hapmap_3.3.hg19.vcf.gz

As usual, there is another way to do this is by using vcftools as already introduced at the end of the script calling variants.

infolder=public_info
infile=NA18507_hapmap_3.3.hg19
 
vcftools --gzvcf ${infolder}/${infile}.vcf.gz \
	--chr chr21 \
	--out ${infolder}/chr21_${infile} \
	--recode
 
# remove the recode tag from the name
mv ${infolder}/chr21_${infile}.recode.vcf ${infolder}/chr21_${infile}.vcf
 
# sort, compress and index using our custom 'vcf2index' function)
# get the content of the custom function by calling it with 'declare -f <funcname>'
vcf2index ${infolder}/chr21_${infile}.vcf

As previsouly, we can analyse this subset with the new bcftools (htscmd)

# print some QC info
infolder=public_info/NA18507_hapmap_3.3.hg19
 
bcftools_0.2.0 stats ${infolder}/chr21_NA18507_hapmap_3.3.hg19.vcf.gz > \
	${infolder}/chr21_NA18507_hapmap_3.3.chk
 
plot-vcfstats_0.2.0  -t "NA18507_hapmap_3.3-calls" \
	-p ${infolder}/NA18507_hapmap_3.3_plots/ \
	${infolder}/chr21_NA18507_hapmap_3.3.chk
NA18507_hapmap_3.3 variants
plot-vcfcheck_NA18507_hapmap_3.3-0.png
plot-vcfcheck_NA18507_hapmap_3.3-1.png
plot-vcfcheck_NA18507_hapmap_3.3-2.png

The analyzed HapMap calls have an abnormally high ts/tv value of 2.89!

Create a VENN diagram for the 4 mappings

Comparing platform and reference genome effects is a common task for NGS projects. It may reveil bias in some datasets or give confidence to users that their analysis workflow did the job.

Online pages exist that will draw nice venn from lists or numbers; a coupe of R packages also provide this kind of service. We found easier to create three scripts to plot such graphs from the counts returned by eg vcftools (see example below).

Handicon.png Please refer to Create Venn plots from counts for more information and code to create Venn plots from your own results.

We have accumulated 5 VCF file for the chr21 part of NA18507:

  • A=chr21_NA18507_aln_var.flt-D1000.vcf.gz
  • B=chr21_NA18507_mem_var.flt-D1000.vcf.gz
  • C=chr21_all_NA18507_CASAVA-1.8_hg19.vcf.gz
  • D=chr21_NA18507_CG_SRR822930.vcf.gz [not used here, read below]
  • E=chr21_NA18507_hapmap_3.3.hg19.vcf.gz (!! this is a gold-standard subset obtained by stringent filtering)

A 5-way comparison is possible including the Complete Genomics calls but as they include a huge number of additional calls, this would render the analysis more difficult. Instead, we perform a 4-way comparison only. Users can later compare any of the intersecting parts with the Complete Genomics data to see how good the isolated calls are as compared to this high quality data. Note that the vcf-compare command computes intersections between the 4 main VCF files based on their coordinates.

infolder=Final_Results/bcftools_samtools_variants
outfolder=vcf-venn/compare-4
mkdir -p ${outfolder}
 
aln=${infolder}/chr21_NA18507_aln_var.flt-D1000.vcf.gz
mem=${infolder}/chr21_NA18507_mem_var.flt-D1000.vcf.gz
casava=public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz
hapmap=public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz
 
# compare all four files and store counts for Venn
vcf-compare -r chr21\
	${aln} \
	${mem} \
	${casava} \
	${hapmap} | \
	grep "^VN" > ${outfolder}/compare-4_pos.cmp

comparison results

1      bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz         (0.0%)   public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz       (0.0%)   public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz  (0.0%)
1      bcftools_samtools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz         (0.0%)   public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz  (0.0%)
2      bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz         (0.0%)   public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz  (0.0%)
21     public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz       (0.0%)   public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz  (0.2%)
52     bcftools_samtools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz         (0.1%)   bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz         (0.1%)   public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz  (0.5%)
61     public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz  (0.6%)
366    bcftools_samtools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz         (0.4%)   public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz       (0.5%)
440    bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz         (0.5%)   public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz       (0.5%)
1743   bcftools_samtools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz         (2.1%)
3033   bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz         (3.6%)
9914   bcftools_samtools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz         (11.9%)  bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz         (11.7%)  public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz       (12.3%)  public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz  (98.6%)
10003  public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz       (12.4%)
11621  bcftools_samtools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz         (13.9%)  bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz         (13.7%)
59722  bcftools_samtools_variants/chr21_NA18507_aln_var.flt-D1000.vcf.gz         (71.6%)  bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz         (70.4%)  public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz       (74.2%)

The counts listed above were fed to the 4DVenn.R script as detailed below to get the corresponding figure.

4-way venn diagram

4DVenn.R -a 1743 -b 3033 -c 10003 -d 61 -e 11621 -f 366 -G 1 -j 440 -k 2 -l 21 -m 59722 -n 52 -q 1 -i 9914 \
   -A BWA_aln -B BWA_mem -C Casava -D HapMap -o compare-4 -x 2 -t "gold standard recall" -u 2
4-way-pos.png


Handicon.png The venn diagram tells us several nice things:

  • The highest counts are for variants found in both our BWA mapping AND the Ilumina CASAVA results (59722+9914, 71.8% of all calls). Out of these, 14% (9914) are also present in the HapMap gold-standard.
  • BWA calls absent in CASAVA are quite some (11621+52) and only 52 are confirmed by the HapMap gold-standard (~0.4%)
  • Ilumina CASAVA private calls are numerous (10003+21) but only 21 calls are confirmed in the HapMap gold-standard (~0.2%).
  • Importantly; 98% of the gold-standard hapmap calls (9914) are confirmed by both BWA methods AND present in CASAVA calls

MORE READINGS: Private categories in such Venn diagram are often referred as false positive and false negative. Several additional metrics can be defined using these numbers and include precision, recall, specificity, sensitivity.

We can conclude that both mapping and variant calling are very satisfactory based on these results. Of course this type of control wil not be possible when you sequence de-novo but these results at least confirm that he methods used here are correct and lead to expected results.

Create a recurrent list of variants called by both BWA and Casava

In order to continue with the highest possible confidence, we isolate now the calls that overlap between both BWA sets AND with the CASAVA data (59722+9914). This is done using vcftools vcf-isec as follows with specific parameters:

'-f' to ignore the fact that sample names are differents
'-o' to name the sample by the first file
'-n =3' to keep calls overlapping in all three samples
'-w 1' to look at overlap with 1bp window between calls

Generate the gold-set from all recalled variants between three sources.

infolder=Final_Results/bcftools_samtools_variants
outfolder=vcf-venn/gold-set
mkdir -p ${outfolder}
 
aln=${infolder}/chr21_NA18507_aln_var.flt-D1000.vcf.gz
mem=${infolder}/chr21_NA18507_mem_var.flt-D1000.vcf.gz
casava=public_info/Illumina_BaseSpace/chr21_NA18507_CASAVA-1.8_hg19.vcf.gz
 
# tripple-intersect, compress, and index
# use -f to ignore mismatched names
vcf-isec -f -o -n =3 -w 1 \
	${mem} \
	${aln} \
	${casava} | \
	bgzip -c > ${outfolder}/chr21_common_NA18507.vcf.gz && \
	tabix -p vcf ${outfolder}/chr21_common_NA18507.vcf.gz
 
# count the obtained calls
zgrep -c "^chr21" ${outfolder}/chr21_common_NA18507.vcf.gz

An now a rapid QC check on resulting 69845 calls (as usual!)

# print some QC info
infolder=vcf-venn/gold-set
 
bcftools_0.2.0 stats ${infolder}/chr21_common_NA18507.vcf.gz > \
	${infolder}/chr21_common_NA18507.chk
 
plot-vcfstats_0.2.0  -t "goldset-chr21_NA18507-calls" \
	-p ${infolder}/chr21_common_NA18507_plots/ \
	${infolder}/chr21_common_NA18507.chk
chr21_common_NA18507 variants
plot-vcfcheck_chr21_common_NA18507-0.png
plot-vcfcheck_chr21_common_NA18507-1.png
plot-vcfcheck_chr21_common_NA18507-2.png
plot-vcfcheck_chr21_common_NA18507-3.png

The result is, as expected, showing ts/tv ratio of 2.13, instead of 2.05 and 2.04 for the individual BWA datasets and 2.04 for Casava, which suggest a better quality for the intersection subset and motivates using intersected variant calls in general.

Handicon.png Results for the tripple intersect list of variants with a good ts/tv value of 2.13 : Intersection makes strength !

A noteworthy 2014 paper in Nature Biotechnology

'Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls' (Zook et al. Nature Biotechnology 32, 246–251 (2014) doi:10.1038/nbt.2835) reports multiple comparison and re-analysis of genome calls on the CEPH genome NA12878 (Utah individual of european origin[1]) that can be used to evaluate true positive calls made by other methods.[2]. The accompanying web-site allows registration and custom analyses http://www.genomeinabottle.org[3]

download exercise files

Download exercise files here

Use the right application to open the files present in ex6-files

References:
  1. http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM12878
  2. Justin M Zook, Brad Chapman, Jason Wang, David Mittelman, Oliver Hofmann, Winston Hide, Marc Salit
    Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls.
    Nat Biotechnol: 2014, 32(3);246-51
    [PubMed:24531798] ##WORLDCAT## [DOI] (I p)

  3. http://www.genomeinabottle.org

[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.5 | NGS Exercise.7 ]