Hands-on introduction to NGS variant analysis

Dates: May 23 & 26, 2014; November 29 2013 & December 17 2013

A shorter and rewritten version of this original two-days training is available as a separate page: Hands-on introduction to NGS variant analysis-2016

The most recent version of this training (2018) can be found at (Hands-on_introduction_to_NGS_variant_analysis-2018)

Hands-on introduction to NGS variant analysis-laptop-file-list

1 Aims of the NGS DNA variant analysis 2-days session
- 1.1 Not covered during this training
- 1.2 More BITS Training Info
2 Summary
3 Prerequisites
4 Complementary readings
5 Training Material
6 Today's Hands-On Training
- 6.1 Housekeeping rules
- 6.2 The Exercises
  - 6.2.1 Code developed in these exercises
  - 6.2.2 Bonus material
7 Conclusion

Aims of the NGS DNA variant analysis 2-days session

Using a full publicly available chromosome read-set from one of the 1000 genomes^[1] samples:

Perform a complete analysis workflow including QC, read mapping, read coverage analysis, and variant calling against the human reference genome.
Use command line and popular open source software to evaluate each step of a classical NGS variant workflow and feel the complexity of the task.
Compare the obtained results with gold standard public data.
Identify bottlenecks and propose possible downstream analyses.
Become familiar with bash scripting and practice some elementary rules to write easy to recycle scripts (more on bash scripting here)

The skills acquired during this session should allow reproduction of the workflow by participants using their own reads and a reasonably-sized personal computer running Unix.

All scripts as well as the most important result files are put online to allow offline practice to trainees and web visitors

Not covered during this training

This is not a Galaxy^[2] training. The popular open-source graphical web platform can perform most steps of this training in an interactive and recordable manner but this is not today's topic. For more information about Galaxy, please refer to the official web-site and follow our training pages. An experimental BITS Galaxy mirror can also be found (when active) at http://galaxy.bits.vib.be^[3].
Unfortunately, this is not a GATK^[4] training. Although excellent and largely used by the NGS community, the Broad Genome Analysis Tool Kit (GATK) is refered to several times in this documentation but is not used by the participants due to license limitations. Academic users can download the GAKT source after registring on the site with their University credentials. For more information about GATK, please refer to the GATK official pages.

More BITS Training Info

On the VIB website: http://www.vib.be/en/training/research-training/courses/Pages/Hands-On-introduction-to-NGS-variant-analysis.aspx
On the BITS website: https://www.bits.vib.be/index.php/training/201-ngs-variant-analysis

Summary

This training gives an introduction to the use of popular NGS analysis software packages at the command line. It reviews several exchangeable tools and provides hints to evaluate quality and content of Genome-Seq data. The code used in the training pages can easily be recycled and adapted, and will constitute the basis of your own workflows.

The sequencing data used in this session was obtained from gDNA extracted from EBV-transformed B-lymphocytes of a healthy Nigerian individual (NA18507). More information is available for that sample from the Coriell repository from which 1000g gDNA can be obtained ^[5].

This training does not cover all currently available methods. It does not aim at bringing users to a professional NGS analyst level but provides enough information to allow motivated biologists understand what DNA sequencing practically is, to read and reuse command-line code, and when necessary to communicate knowingly with NGS experts for more in-depth needs.

Prerequisites

Skills required to follow this training:

Linux command line basic skills are absolutely required to understand the code used throughout the session
basic knowledge of human genome structure and nomenclature is necessary to estimate the training tasks
basic knowledge of Illumina NGS read structure is also required for the same reason
Inspiring NGS variant analysis custom functions and code that you can use and reuse

People who have no experience with Linux command line should first follow the 'Linux for bioinformatics' training (^[6]) or they will slow down the full class. All attendees who lack knowledge about Illumina reads should follow the preliminary one day 'Introduction to the analysis of NGS data' training (^[7]).

Complementary readings

In addition to today's menu, read these pages to refresh your memory and learn about important formats and conventions used during this training.

Introduction of NGS-formats used in classical NGS applications and used today in the hands-on
Practical remarks about NGS variant analysis: laptop configuration and files presented in the sessions

A survey of tools for variant analysis of next-generation genome sequencing data published early 2013^[8] and still very actual.

Training Material

Most of the content produced during today's session was added on the BITS platform (see links at bottom of each exercise page)

Additional files were also stored that relate to:

Documentation on the BITS server, including a PDF handout of the 2013 version of this training NGS-DNASeq_chr21-session.pdf
Reference files obtained from public sources and relative to the same NA19507 genome.

Today's main tools

In addition to the many built-in unix command, originating from the GNU consortium for a good part (please read the GNU coreutils manual for more ^[9]), we will use today several dedicated toolboxes including:

• bwa, the Burrow Wheeler Aligner for mapping reads to the reference genome

Version: 0.7.5a-r405
Contact: Heng Li <lh3@sanger.ac.uk>

Usage: bwa <command> [options]

Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries

fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'. There are
three alignment algorithms in BWA: `mem', `bwasw' and `aln/samse/sampe'. If
you are not sure which to use, try `bwa mem' first. Please `man ./bwa.1' for
for the manual.

• bamUtil tools to edit BAM data

Set of tools for operating on SAM/BAM files.
Version: 1.0.9; Built: Thu Oct 31 11:02:01 CET 2013 by root
 
Tools to Rewrite SAM/BAM Files:
 convert - Convert SAM/BAM to SAM/BAM
 writeRegion - Write a file with reads in the specified region and/or have the specified read name
 splitChromosome - Split BAM by Chromosome
 splitBam - Split a BAM file into multiple BAM files based on ReadGroup
 findCigars - Output just the reads that contain any of the specified CIGAR operations.
 
Tools to Modify & write SAM/BAM Files:
 clipOverlap - Clip overlapping read pairs in a SAM/BAM File already sorted 
    by Coordinate or ReadName
 filter - Filter reads by clipping ends with too high of a mismatch percentage 
    and by marking reads unmapped if the quality of mismatches is too high
 revert - Revert SAM/BAM replacing the specified fields with their previous values 
    (if known) and removes specified tags
 squeeze -  reduces files size by dropping OQ fields, duplicates, & specified tags, 
    using '=' when a base matches the reference, 
    binning quality scores, and replacing readNames with unique integers
 trimBam - Trim the ends of reads in a SAM/BAM file changing read ends to 'N' 
    and quality to '!'
 mergeBam - merge multiple BAMs and headers appending ReadGroupIDs 
    if necessary
 polishBam - adds/updates header lines & adds the RG tag to each record
 dedup - Mark Duplicates
 recab - Recalibrate
 
Informational Tools
 validate - Validate a SAM/BAM File
 diff - Diff 2 coordinate sorted SAM/BAM files.
 stats - Stats a SAM/BAM File
 gapInfo - Print information on the gap between read pairs in a SAM/BAM File.
 
Tools to Print Information In Readable Format
 dumpHeader - Print SAM/BAM Header
 dumpRefInfo - Print SAM/BAM Reference Name Information
 dumpIndex - Print BAM Index File in English
 readReference - Print the reference string for the specified region
 explainFlags - Describe flags
 
Additional Tools
 bam2FastQ - Convert the specified BAM file to fastQs.
 
Dummy/Example Tools
 readReference - Print the reference string for the specified region
 
Usage:  
        bam <tool> [<tool arguments>]
The usage for each tool is described by specifying the tool with no arguments.

• samtools, mother of all NGS tools and dealing with SAM/BAM formats and with calling of variants (bcftools)

Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.19+

Usage: samtools <command> [options]

Command: view SAM<->BAM conversion
sort sort alignment file
mpileup multi-way pileup
depth compute the depth
faidx index/extract FASTA
tview text alignment viewer
index index alignment
idxstats BAM index stats (r595 or later)
fixmate fix mate information
flagstat simple stats
calmd recalculate MD/NM tags and '=' bases
merge merge sorted alignments
rmdup remove PCR duplicates
reheader replace BAM header
cat concatenate BAMs
bedcov read depth per BED region
targetcut cut fosmid regions (for fosmid pool only)
phase phase heterozygotes
bamshuf shuffle and group alignments by name

• bcftools (samtools) the classical variant caller used on samtools pileup results

The command was re-baptized bcftools_samtools to differentiate it from the more recent version baptized here bcftools

Program: bcftools (Tools for data in the VCF/BCF formats)
Version: 0.1.19-44428cd

Usage: bcftools <command> <arguments>

Command: view print, extract, convert and call SNPs from BCF
index index BCF
cat concatenate BCFs
ld compute all-pair r^2
ldpair compute r^2 between requested pairs

• vcfutils.pl (samtools) the vcf toolbox used to filter bcftools results

The command was re-baptized vcfutils_samtools.pl to differentiate it from the bcftools copy vcfutils.pl which may differ at some point

Usage: vcfutils.pl <command> [<arguments>]

Command: subsam get a subset of samples
listsam list the samples
fillac fill the allele count field
qstats SNP stats stratified by QUAL

hapmap2vcf convert the hapmap format to VCF
ucscsnp2vcf convert UCSC SNP SQL dump to VCF

varFilter filtering short variants (*)
vcf2fq VCF->fastq (**)

Notes: Commands with description ending with (*) may need bcftools
specific annotations.

• Picard (2.x) ^[10], the 'broadly' used java wrapper for samtools with added functionalities

## how to run Picard tools in recent versions ?
# provided PICARD is defined and points to the place where the picard.jar file is located, you can type:
java -jar $PICARD/picard.jar <command-name> <command-options...>

## current sorted list of Picard commands:
AddCommentsToBam.jar, AddOrReplaceReadGroups.jar, BamIndexStats.jar, BamToBfq.jar, BuildBamIndex.jar,
CalculateHsMetrics.jar, CheckIlluminaDirectory.jar, CleanSam.jar, CollectAlignmentSummaryMetrics.jar,
CollectGcBiasMetrics.jar, CollectInsertSizeMetrics.jar, CollectMultipleMetrics.jar, CollectRnaSeqMetrics.jar,
CollectTargetedPcrMetrics.jar, CollectWgsMetrics.jar, CompareSAMs.jar, CreateSequenceDictionary.jar,
DownsampleSam.jar, EstimateLibraryComplexity.jar, ExtractIlluminaBarcodes.jar, ExtractSequences.jar,
FastqToSam.jar, FilterSamReads.jar, FixMateInformation.jar, GatherBamFiles.jar, IlluminaBasecallsToFastq.jar,
IlluminaBasecallsToSam.jar, IntervalListTools.jar, MakeSitesOnlyVcf.jar, MarkDuplicates.jar, MarkIlluminaAdapters.jar,
MeanQualityByCycle.jar, MergeBamAlignment.jar, MergeSamFiles.jar, MergeVcfs.jar, NormalizeFasta.jar,
QualityScoreDistribution.jar, ReorderSam.jar, ReplaceSamHeader.jar, RevertOriginalBaseQualitiesAndAddMateCigar.jar,
RevertSam.jar, SamFormatConverter.jar, SamToFastq.jar, SortSam.jar, SplitVcfs.jar, ValidateSamFile.jar,
VcfFormatConverter.jar, ViewSam.jar

## get help in Picard by typing
java -jar $PICARD picard.jar <command> -h

• samtools (htslib), the developer version of samtools

Program: samtools (Tools for alignments in the SAM format)

Version: 0.2.0-rc7-99-gd60458f (using htslib 0.2.0-rc7-37-gdb8055a)

Usage: samtools <command> [options]

Commands:

 -- indexing
        faidx       index/extract FASTA
        index       index alignment
 -- editing
        calmd       recalculate MD/NM tags and '=' bases
        fixmate     fix mate information
        reheader    replace BAM header
        rmdup       remove PCR duplicates
        targetcut   cut fosmid regions (for fosmid pool only)
 -- file operations
        bamshuf     shuffle and group alignments by name
        cat         concatenate BAMs
        merge       merge sorted alignments
        mpileup     multi-way pileup
        sort        sort alignment file
        split       splits a file by read group
 -- stats
        bedcov      read depth per BED region
        depth       compute the depth
        flagstat    simple stats
        idxstats    BAM index stats
        phase       phase heterozygotes
        stats       generate stats (former bamcheck)
 -- viewing
        flags       explain BAM flags
        tview       text alignment viewer
        view        SAM<->BAM<->CRAM conversion

</syntaxhighlight>

• bcftools (htslib), the samtools (dev) accompanying toolbox

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
Version: 0.2.0-rc7-52-ge3fe732 (using htslib 0.2.0-rc7-37-gdb8055a)

Usage: bcftools <command> <argument>

Commands:

-- Indexing
index index VCF/BCF files

-- VCF/BCF manipulation
annotate annotate and edit VCF/BCF files
concat concatenate VCF/BCF files from the same set of samples
isec intersections of VCF/BCF files
merge merge VCF/BCF files files from non-overlapping sample sets
norm left-align and normalize indels
query transform VCF/BCF into user-defined formats
view VCF/BCF conversion, view, subset and filter VCF/BCF files

-- VCF/BCF analysis
call SNP/indel calling
filter filter VCF/BCF files using fixed thresholds
gtcheck check sample concordance, detect sample swaps and contamination
roh identify runs of autozygosity (HMM)
stats produce VCF/BCF stats

Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
automatically even when streaming from a pipe. Indexed VCF and BCF will work
in all situations. Un-indexed VCF and BCF and streams will work in most but
not all situations.

• bedtools, a multiusage BED/SAM/VCF manipulation suite

Get the full BedTools PDF manual from https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf ^[11]

Read it online at http://bedtools.readthedocs.org/ ^[12]

bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
usage: bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]
intersect Find overlapping intervals in various ways.
window Find overlapping intervals within a window around an interval.
closest Find the closest, potentially non-overlapping interval.
coverage Compute the coverage over defined intervals.
map Apply a function to a column for each overlapping interval.
genomecov Compute the coverage over an entire genome.
merge Combine overlapping/nearby intervals into a single interval.
cluster Cluster (but don't merge) overlapping/nearby intervals.
complement Extract intervals _not_ represented by an interval file.
subtract Remove intervals based on overlaps b/w two files.
slop Adjust the size of intervals.
flank Create new intervals from the flanks of existing intervals.
sort Order the intervals in a file.
random Generate random intervals in a genome.
shuffle Randomly redistrubute intervals in a genome.
annotate Annotate coverage of features from multiple files.

[ Multi-way file comparisons ]
multiinter Identifies common intervals among multiple interval files.
unionbedg Combines coverage intervals from multiple BEDGRAPH files.

[ Paired-end manipulation ]
pairtobed Find pairs that overlap intervals in various ways.
pairtopair Find pairs that overlap other pairs in various ways.

[ Format conversion ]
bamtobed Convert BAM alignments to BED (& other) formats.
bedtobam Convert intervals to BAM records.
bamtofastq Convert BAM records to FASTQ records.
bedpetobam Convert BEDPE intervals to BAM records.
bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.

[ Fasta manipulation ]
getfasta Use intervals to extract sequences from a FASTA file.
maskfasta Use intervals to mask sequences from a FASTA file.
nuc Profile the nucleotide content of intervals in a FASTA file.

[ BAM focused tools ]
multicov Counts coverage from multiple BAMs at specific intervals.
tag Tag BAM alignments based on overlaps with interval files.

[ Statistical relationships ]
jaccard Calculate the Jaccard statistic b/w two sets of intervals.
reldist Calculate the distribution of relative distances b/w two files.

[ Miscellaneous tools ]
overlap Computes the amount of overlap from two intervals.
igv Create an IGV snapshot batch script.
links Create a HTML page of links to UCSC locations.
makewindows Make interval "windows" across a genome.
groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
expand Replicate lines based on lists of values in columns.

[ General help ]
--help Print this help menu.
--version What version of bedtools are you using?.
--contact Feature requests, bugs, mailing lists, etc.

• IGVtools, the toolbox used to prepare files for IGV (compress, index, and more)

Program: igvtools. IGV Version 2.3.32 (31)03/17/2014 09:01 PM

Usage: igvtools [command] [options] [input file/dir] [other arguments]

Command: version print the version number
sort sort an alignment file by start position.
index index an alignment file
toTDF convert an input file (cn, gct, wig) to tiled data format (tdf)
count compute coverage density for an alignment file
formatexp center, scale, and log2 normalize an expression file
gui Start the gui
help <command> display this help message, or help on a specific command
See http://www.broadinstitute.org/software/igv/igvtools_commandline for more detailed help

• annotate_variation.pl; most used Annovar command (but not only!)

Usage:
     annotate_variation.pl [arguments] <query-file|table-name> <database-location>
 
     Optional arguments:
            -h, --help                      print help message
            -m, --man                       print complete documentation
            -v, --verbose                   use verbose output
 
            Arguments to download databases or perform annotations
                --downdb                    download annotation database
                --geneanno                  annotate variants by gene-based annotation (infer functional consequence on genes)
                --regionanno                annotate variants by region-based annotation (find overlapped regions in database)
                --filter                    annotate variants by filter-based annotation (find identical variants in database)
 
            Arguments to control input and output
                --outfile <file>            output file prefix
                --webfrom <string>          specify the source of database (ucsc or annovar or URL) (downdb operation)
                --dbtype <string>           specify database type
                --buildver <string>         specify genome build version (default: hg18 for human)
                --time                      print out local time during program run
                --comment                   print out comment line (those starting with #) in output files 
                --exonsort                  sort the exon number in output line (gene-based annotation)
                --transcript_function       use transcript name rather than gene name (gene-based annotation)
                --hgvs                      use HGVS format for exonic annotation (c.122C>T rather than c.C122T) (gene-based annotation)
                --separate                  separately print out all functions of a variant in several lines (gene-based annotation)
                --seq_padding               create a new file with cDNA sequence padded by this much either side (gene-based annotation)
                --(no)firstcodondel         treat first codon deletion as wholegene deletion (default: ON) (gene-based annotation)
                --aamatrix <file>           specify an amino acid substitution matrix file (gene-based annotation)
                --colsWanted <string>       specify which columns to output by comma-delimited numbers (region-based annotation)
                --scorecolumn <int>         the column with scores in DB file (region-based annotation)
                --gff3dbfile <file>         specify a DB file in GFF3 format (region-based annotation)
                --bedfile <file>            specify a DB file in BED format file (region-based annotation)
                --gff3attribute             output all fields in GFF3 attribute (default: ID and score only)
                --genericdbfile <file>      specify a DB file in generic format (filter-based annotation)
                --vcfdbfile <file>          specify a DB file in VCF format (filter-based annotation)
                --otherinfo                 print out additional columns in database file (filter-based annotation)
                --infoasscore               use INFO field in VCF file as score in output (filter-based annotation)
 
 
            Arguments to fine-tune the annotation procedure
                --batchsize <int>           batch size for processing variants per batch (default: 5m)
                --genomebinsize <int>       bin size to speed up search (default: 100k for -geneanno, 10k for -regionanno)
                --expandbin <int>           check nearby bin to find neighboring genes (default: 2m/genomebinsize)
                --neargene <int>            distance threshold to define upstream/downstream of a gene
                --exonicsplicing            report exonic variants near exon/intron boundary as 'exonic;splicing' variants
                --score_threshold <float>   minimum score of DB regions to use in annotation
                --reverse                   reverse directionality to compare to score_threshold
                --normscore_threshold <float> minimum normalized score of DB regions to use in annotation
                --rawscore                  output includes the raw score (not normalized score) in UCSC Browser Track
                --minqueryfrac <float>      minimum percentage of query overlap to define match to DB (default: 0)
                --splicing_threshold <int>  distance between splicing variants and exon/intron boundary (default: 2)
                --indel_splicing_threshold <int>    if set, use this value for allowed indel size for splicing variants (default: --splicing_threshold)
                --maf_threshold <float>     filter 1000G variants with MAF above this threshold (default: 0)
                --sift_threshold <float>    SIFT threshold for deleterious prediction for -dbtype avsift (default: 0.05)
                --precedence <string>       comma-delimited to specify precedence of variant function (default: exonic>intronic...)
                --indexfilter_threshold <float>     controls whether filter-based annotation use index if this fraction of bins need to be scanned (default: 0.9)
 
           Arguments to control memory usage
                --memfree <int>             ensure minimum amount of free system memory (default: 0)
                --memtotal <int>            limit total amount of memory used by ANNOVAR (default: 0, unlimited, in the order of kb)
                --chromosome <string>       examine these specific chromosomes in database file
 
 
     Function: annotate a list of genetic variants against genome annotation 
     databases stored at local disk.
 
     Example: #download gene annotation database (for hg18 build) and save to humandb/ directory
              annotate_variation.pl -downdb refGene humandb/
              annotate_variation.pl -buildver mm9 -downdb  mousedb/
              annotate_variation.pl -downdb -webfrom annovar esp6500si_all humandb/
 
              #gene-based annotation of variants in the varlist file (by default --geneanno is ON)
              annotate_variation.pl ex1.human humandb/
 
              #region-based annotate variants
              annotate_variation.pl -regionanno -dbtype cytoBand ex1.human humandb/
              annotate_variation.pl -regionanno -dbtype gff3 -gff3dbfile tfbs.gff3 ex1.human humandb/
 
              #filter rare or unreported variants (in 1000G/dbSNP) or predicted deleterious variants
              annotate_variation.pl -filter -dbtype 1000g2012apr_all -maf 0.01 ex1.human humandb/
              annotate_variation.pl -filter -dbtype snp130 ex1.human humandb/
              annotate_variation.pl -filter -dbtype avsift ex1.human humandb/
 
     Version: $LastChangedDate: 2013-08-23 11:32:41 -0700 (Fri, 23 Aug 2013) $

• snpEff commands used to add annotations to VCF data

snpEff version SnpEff 3.4 (build 2013-11-23), by Pablo Cingolani
Usage: snpEff [command] [options] [files]
 
Available commands: 
   [eff]           : Calculate effect of variants. Default: eff (no command or 'eff').
   build           : Build a SnpEff database.
   buildNextProt   : Build a SnpEff for NextProt (using NextProt's XML files).
   cds             : Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness.
   closest         : Annotate the closest genomic region.
   count           : Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval.
   databases       : Show currently available databases (from local config file).
   download        : Download a SnpEff database.
   dump            : Dump to STDOUT a SnpEff database (mostly used for debugging).
   genes2bed       : Create a bed file from a genes list.
   len             : Calculate total genomic length for each marker type.
   protein         : Compare protein sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness.
   spliceAnalysis  : Perform an analysis of splice sites. Experimental feature.
 
Run 'java -jar snpEff.jar command' for help on each specific command
 
Generic options:
	-c , -config     : Specify config file
	-d , -debug      : Debug mode (very verbose).
	-dataDir <path>  : Override data_dir parameter from config file.
	-h , -help       : Show this help and exit
	-noLog           : Do not report usage statistics to server
	-q , -quiet      : Quiet mode (do not show any messages or errors)
	-v , -verbose    : Verbose mode

• snpSift commands used to filter snpEff-annotated VCF data

SnpSift version SnpSift 3.4 (build 2013-11-23), by Pablo Cingolani
Usage: java -jar SnpSift.jar [command] params...
Command is one of:
alleleMat : Create an allele matrix output.
annotate : Annotate 'ID' from a database (e.g. dbSnp). Assumes entries are sorted.
annMem : Annotate 'ID' from a database (e.g. dbSnp). Loads db in memory. Does not assume sorted entries.
caseControl : Compare how many variants are in 'case' and in 'control' groups; calculate p-values.
ccs : Case control summary. Case and control summaries by region, allele frequency and variant's functional effect.
coevolution : Co-evoution case control model (this feature is alpha).
concordance : Concordance metrics between two VCF files.
covMat : Create an covariance matrix output (allele matrix as input).
dbnsfp : Annotate with multiple entries from dbNSFP. <EXPERIMENTAL>
extractFields : Extract fields from VCF file into tab separated format.
filter : Filter using arbitrary expressions
geneSets : Annotate using MSigDb gene sets (MSigDb includes: GO, KEGG, Reactome, BioCarta, etc.)
gt : Add Genotype to INFO fields and remove genotype fields when possible.
gwasCat : Annotate using GWAS catalog
hwe : Calculate Hardy-Weimberg parameters and perform a godness of fit test.
intersect : Intersect intervals (genomic regions).
intervals : Keep variants that intersect with intervals.
intIdx : Keep variants that intersect with intervals. Index-based method: Used for large VCF file and a few intervals to retrieve
join : Join files by genomic region.
phastCons : Annotate using conservation scores (phastCons).
private : Annotate if a variant is private to a family or group.
rmRefGen : Remove reference genotypes.
rmInfo : Remove INFO fields.
sift : Annotate using SIFT scores from a VCF file.
split : Split VCF by chromosome.
tstv : Calculate transiton to transversion ratio.
varType : Annotate variant type (SNP,MNP,INS,DEL or MIXED).
vcf2tped : Convert VCF to TPED.

Version number reported for each tool correspond to the lowest version used to prepare this training and may differ from the current version. Older versions may as well do!

Additional tools were installed from downloads or from source and are referenced to web links in the respective pages

Tools in the spotlight

Some tools found during the preparation of this training but not covered today include:

seqtk if you need to convert between FastA <-> FastQ (by Heng Li)

overview

Usage: seqtk <command> <arguments>
Version: 1.0-r32

Command: seq common transformation of FASTA/Q
comp get the nucleotide composition of FASTA/Q
sample subsample sequences
subseq extract subsequences from FASTA/Q
trimfq trim FASTQ using the Phred algorithm

hety regional heterozygosity
mutfa point mutate FASTA at specified positions
mergefa merge two FASTA/Q files
randbase choose a random base from hets
cutN cut sequence at long N
listhet extract the position of each het

bioawk, a modified AWK to dig into BED/VCF/SAM/... (by Heng Li)

overview

usage: bioawk [-F fs] [-v var=value] [-c fmt] [-tH] [-f progfile | 'prog'] [file ...]

###Examples

1. List the supported formats:

# bioawk -c help
bed: 1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts
sam 1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual
vcf: 1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info
gff: 1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute
fastx: 1:name 2:seq 3:qual 4:comment

2. Extract unmapped reads without header:

bioawk -c sam 'and($flag,4)' aln.sam.gz

2b.Extract unmapped read pairs with header: potential novel sequences absent from the reference genome and/or unmappable reads due to sequence content

bioawk -Hc sam 'and($flag,13)' als.sam

3. Extract mapped reads with header:

bioawk -Hc sam '!and($flag,4)'

4. Reverse complement FASTA:

bioawk -c fastx '{print ">"$name;print revcomp($seq)}' seq.fa.gz

5. Create FASTA from SAM (uses revcomp if FLAG & 16)

samtools view aln.bam | \
bioawk -c sam '{s=$seq; if(and($flag, 16)) {s=revcomp($seq)} print ">"$qname"\n"s}'

6. Print the genotypes of sample `foo` and `bar` from a VCF:

grep -v ^## in.vcf | bioawk -tc hdr '{print $foo,$bar}'

BEDOPS, a serious challenger for bedtools and is certainly worth a try. Get the current PDF manual from https://media.readthedocs.org/pdf/bedops/latest/bedops.pdf^[13]

Find more tools and answers

There are many tools out there, finding them is often the easiest part. You are welcome to try as many as you wish and improve results obtained with our selected toolbox. When seeking advice, please consider using:

SeqAnswers^[14] for all what relates to NGS
BioStar^[15] for questions about biocomputing and scripting for biologists
stackoverflow^[16] for questions related to coding.

bioinformatics.ca directory^[17] to find bioinformatics tools sorted by categories.

The EBI online training propose many very nice training sessions with slides and exercises.

Today's Hands-On Training

Housekeeping rules

All exercises were performed by the trainer on one of the BITS laptops and the results moved to a new subfolder called Final_Results on the BITS laptops. This ensures that you can take up any exercise and will find the necessary input data in the corresponding folder even if you did not perform the previous exercises.

The results that You will generate will be stored in similarly named folders created directly under your working folder and will not overwrite the 'good' versions stored in Final_Results.

Additionally, key result files were copied on the BITS file server and can be downloaded from links present at the bottom of each exercise page.

All scripts described during the training are store in the scripts folder. These scripts, as well as all shorter command blocks presented in the exercises expect that your terminal points to the training folder. The training folder is known by the laptop as BASE. You should always run your commands while being in the BASE-camp. To make sure you are at the right address, type at any time: cd $BASE ⌅ Enter.

The Exercises

Code developed in these exercises

Most scripts included in the exercises below are reproduced and maintained on our dedicated GitHub page https://github.com/splaisan/NGS-Variant-Analysis-training ^[18] (please feed-us back about any issue with this page).

Exercises with a coloured text background are long and not required for today. You can keep them for later.

NGS Exercise.1: Download Illumina data, perform QC and correct SAM/BAM formatting errors
NGS Exercise.2: Extract fastQ reads from BAM and perform basic read quality control
NGS Exercise.3: Map paired end reads to the human reference genome hg19 using the Burrow Wheeler Aligner (BWA)
- NGS Exercise.3b: Perform QC on the BWA alignment data
- Added material: NGS Exercise.3c: Map paired end reads to the human reference genome hg19 using Bowtie2
- Soon to be added material: NGS Exercise.3d: Map paired end reads to the human reference genome hg19 using STAR
- Soon to be added material: NGS Exercise.3e: Map paired end reads to the human reference genome hg19 using RTG
NGS Exercise.4: Compute read coverage from the mapping results
- NGS Exercise.4b: Retrieve unsequenced regions of chr21 for Sanger re-sequencing
NGS Exercise.5: Call variations by comparing the aligned reads with the reference sequence
- NGS Exercise.5_GATK: Call variations using the GATK (not covered today)
NGS Exercise.6: Compare and intersect VCF files
NGS Exercise.7: Add annotations to VCF files
- NGS_Exercise.7_annovar after converting the data
- NGS_Exercise.7_vcfCodingSnps inline using the veteran app
- NGS_Exercise.7_SnpEff do it using the emerging standard
NGS Exercise.8: Visualize results in the Interactive Genome Viewer (IGV)
NGS Exercise.9: Extract SOD1 locus specific data from large SAM/BAM, BED, or VCF data files

Bonus material

Call_variants_with_samtools_1.0 Call chr21 variants from newly mapped BAM data using samtools (htslib) and bcftools (htslib) OR varscan2 (added: 09-2014)

Conclusion

This hands-on training session was aimed at presenting a simplified although complete workflow for NGS variant analysis and to train users in taking advantage of the many command-line interface resources.

The number of research groups generating NGS data is increasing upon time and the demand for data analysis is getting greater than the capacity of the facilities within the Institute and too expensive through outsourced commercial help. For these reason, more and more students and staff scientists feel the 'urgent need' to perform their own NGS data analysis.

Although a certain level of computer infrastructure is required in order to perform this analysis in ideal conditions, knowledge and training on NGS analysis seem to be today's main limiting factors in research groups.

This session demonstrates that a minimum of training is sufficient to allow classical NGS analysis on a laptop computer. For higher throughput an more complex analysis, additional learning and more resources will be required but a lot can be initiated using the many available online resources and communities.

Your feedback to this two-days general command-line based NGS data analysis session is very important to us and will be used to improve this content for later sessions. If you need more training of this kind, please contact us and we will organize additional hands-on based on your requests. More advance sessions will depend on the availability of expert users within VIB that will accept to prepare specified material.

PS: We tried to cite resources and people that made this training possible. If you feel that some citations are missing, please inform us and we will correct this omission.

contact us

References:

↑ http://www.1000genomes.org
↑ http://main.g2.bx.psu.edu
↑ http://galaxy.bits.vib.be
↑ http://www.broadinstitute.org/gatk/
↑ http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM18507
↑ http://wiki.bits.vib.be/index.php/Introduction_to_Linux_for_bioinformatics
↑ http://wiki.bits.vib.be/index.php/NGS_data_analysis
↑
Stephan Pabinger, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler, Michael R Speicher, Johannes Zschocke, Zlatko Trajanoski
A survey of tools for variant analysis of next-generation genome sequencing data.
Brief Bioinform: 2014, 15(2);256-78
[PubMed:23341494] ##WORLDCAT## [DOI] (I p)
↑ http://www.gnu.org/software/coreutils/manual/coreutils.pdf
↑ http://picard.sourceforge.net
↑ https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf
↑ http://bedtools.readthedocs.org/
↑ https://media.readthedocs.org/pdf/bedops/latest/bedops.pdf
↑ http://seqanswers.com SeqAnswers
↑ http://www.biostars.org BioStar
↑ http://stackoverflow.com stackoverflow
↑ http://bioinformatics.ca/links_directory
↑ https://github.com/splaisan/NGS-Variant-Analysis-training

[ Main_Page ]

[1] ttp://www.1000genomes.org

[2] ttp://main.g2.bx.psu.edu

[3] ttp://galaxy.bits.vib.be

[4] ttp://www.broadinstitute.org/gatk/

[5] ttp://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM18507

[6] ttp://wiki.bits.vib.be/index.php/Introduction_to_Linux_for_bioinformatics

[7] ttp://wiki.bits.vib.be/index.php/NGS_data_analysis

[8] 
Stephan Pabinger, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler, Michael R Speicher, Johannes Zschocke, Zlatko Trajanoski
A survey of tools for variant analysis of next-generation genome sequencing data.
Brief Bioinform: 2014, 15(2);256-78
[PubMed:23341494] ##WORLDCAT## [DOI] (I p)

[9] ttp://www.gnu.org/software/coreutils/manual/coreutils.pdf

[10] ttp://picard.sourceforge.net

[11] ttps://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf

[12] ttp://bedtools.readthedocs.org/

[13] ttps://media.readthedocs.org/pdf/bedops/latest/bedops.pdf

[14] ttp://seqanswers.com SeqAnswers

[15] ttp://www.biostars.org BioStar

[16] ttp://stackoverflow.com stackoverflow

[17] ttp://bioinformatics.ca/links_directory

[18] ttps://github.com/splaisan/NGS-Variant-Analysis-training

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

Hands-on introduction to NGS variant analysis

Contents

Aims of the NGS DNA variant analysis 2-days session

Not covered during this training

More BITS Training Info

Summary

Prerequisites

Complementary readings

Training Material

Today's main tools

Tools in the spotlight

Find more tools and answers

Today's Hands-On Training

Housekeeping rules

The Exercises

Code developed in these exercises

Bonus material

Conclusion

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox