Hands-on introduction to NGS variant analysis

From BITS wiki
Jump to: navigation, search

Dates: May 23 & 26, 2014; November 29 2013 & December 17 2013

[ Main_Page | NGS_data_analysis ]

Technical.png A shorter and rewritten version of this original two-days training is available as a separate page: Hands-on introduction to NGS variant analysis-2016

Technical.png The most recent version of this training (2018) can be found at (Hands-on_introduction_to_NGS_variant_analysis-2018)


Hands-on introduction to NGS variant analysis-laptop-file-list

Aims of the NGS DNA variant analysis 2-days session

Using a full publicly available chromosome read-set from one of the 1000 genomes[1] samples:

  • Perform a complete analysis workflow including QC, read mapping, read coverage analysis, and variant calling against the human reference genome.
  • Use command line and popular open source software to evaluate each step of a classical NGS variant workflow and feel the complexity of the task.
  • Compare the obtained results with gold standard public data.
  • Identify bottlenecks and propose possible downstream analyses.
  • Become familiar with bash scripting and practice some elementary rules to write easy to recycle scripts (more on bash scripting here)


The skills acquired during this session should allow reproduction of the workflow by participants using their own reads and a reasonably-sized personal computer running Unix.

Technical.png All scripts as well as the most important result files are put online to allow offline practice to trainees and web visitors

Not covered during this training

  • This is not a Galaxy[2] training. The popular open-source graphical web platform can perform most steps of this training in an interactive and recordable manner but this is not today's topic. For more information about Galaxy, please refer to the official web-site and follow our training pages. An experimental BITS Galaxy mirror can also be found (when active) at http://galaxy.bits.vib.be[3].
  • Unfortunately, this is not a GATK[4] training. Although excellent and largely used by the NGS community, the Broad Genome Analysis Tool Kit (GATK) is refered to several times in this documentation but is not used by the participants due to license limitations. Academic users can download the GAKT source after registring on the site with their University credentials. For more information about GATK, please refer to the GATK official pages.

More BITS Training Info


This training gives an introduction to the use of popular NGS analysis software packages at the command line. It reviews several exchangeable tools and provides hints to evaluate quality and content of Genome-Seq data. The code used in the training pages can easily be recycled and adapted, and will constitute the basis of your own workflows.

The sequencing data used in this session was obtained from gDNA extracted from EBV-transformed B-lymphocytes of a healthy Nigerian individual (NA18507). More information is available for that sample from the Coriell repository from which 1000g gDNA can be obtained [5].

This training does not cover all currently available methods. It does not aim at bringing users to a professional NGS analyst level but provides enough information to allow motivated biologists understand what DNA sequencing practically is, to read and reuse command-line code, and when necessary to communicate knowingly with NGS experts for more in-depth needs.


Skills required to follow this training:

  • Linux command line basic skills are absolutely required to understand the code used throughout the session
  • basic knowledge of human genome structure and nomenclature is necessary to estimate the training tasks
  • basic knowledge of Illumina NGS read structure is also required for the same reason
  • Inspiring NGS variant analysis custom functions and code that you can use and reuse

People who have no experience with Linux command line should first follow the 'Linux for bioinformatics' training​ ([6]) or they will slow down the full class. All attendees who lack knowledge about Illumina reads should follow the preliminary one day 'Introduction to the analysis of NGS data' training ([7]).

Complementary readings

In addition to today's menu, read these pages to refresh your memory and learn about important formats and conventions used during this training.

  • A survey of tools for variant analysis of next-generation genome sequencing data published early 2013[8] and still very actual.

Training Material

Handicon.png Most of the content produced during today's session was added on the BITS platform (see links at bottom of each exercise page)

Additional files were also stored that relate to:

Today's main tools

In addition to the many built-in unix command, originating from the GNU consortium for a good part (please read the GNU coreutils manual for more [9]), we will use today several dedicated toolboxes including:

• bwa, the Burrow Wheeler Aligner for mapping reads to the reference genome

Version: 0.7.5a-r405
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'. There are
      three alignment algorithms in BWA: `mem', `bwasw' and `aln/samse/sampe'. If
      you are not sure which to use, try `bwa mem' first. Please `man ./bwa.1' for
      for the manual.

• bamUtil tools to edit BAM data

Set of tools for operating on SAM/BAM files.
Version: 1.0.9; Built: Thu Oct 31 11:02:01 CET 2013 by root
Tools to Rewrite SAM/BAM Files:
 convert - Convert SAM/BAM to SAM/BAM
 writeRegion - Write a file with reads in the specified region and/or have the specified read name
 splitChromosome - Split BAM by Chromosome
 splitBam - Split a BAM file into multiple BAM files based on ReadGroup
 findCigars - Output just the reads that contain any of the specified CIGAR operations.
Tools to Modify & write SAM/BAM Files:
 clipOverlap - Clip overlapping read pairs in a SAM/BAM File already sorted 
    by Coordinate or ReadName
 filter - Filter reads by clipping ends with too high of a mismatch percentage 
    and by marking reads unmapped if the quality of mismatches is too high
 revert - Revert SAM/BAM replacing the specified fields with their previous values 
    (if known) and removes specified tags
 squeeze -  reduces files size by dropping OQ fields, duplicates, & specified tags, 
    using '=' when a base matches the reference, 
    binning quality scores, and replacing readNames with unique integers
 trimBam - Trim the ends of reads in a SAM/BAM file changing read ends to 'N' 
    and quality to '!'
 mergeBam - merge multiple BAMs and headers appending ReadGroupIDs 
    if necessary
 polishBam - adds/updates header lines & adds the RG tag to each record
 dedup - Mark Duplicates
 recab - Recalibrate
Informational Tools
 validate - Validate a SAM/BAM File
 diff - Diff 2 coordinate sorted SAM/BAM files.
 stats - Stats a SAM/BAM File
 gapInfo - Print information on the gap between read pairs in a SAM/BAM File.
Tools to Print Information In Readable Format
 dumpHeader - Print SAM/BAM Header
 dumpRefInfo - Print SAM/BAM Reference Name Information
 dumpIndex - Print BAM Index File in English
 readReference - Print the reference string for the specified region
 explainFlags - Describe flags
Additional Tools
 bam2FastQ - Convert the specified BAM file to fastQs.
Dummy/Example Tools
 readReference - Print the reference string for the specified region
        bam <tool> [<tool arguments>]
The usage for each tool is described by specifying the tool with no arguments.

• samtools, mother of all NGS tools and dealing with SAM/BAM formats and with calling of variants (bcftools)

Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.19+

    Usage:   samtools <command> [options]

Command: view        SAM<->BAM conversion
         sort        sort alignment file
         mpileup     multi-way pileup
         depth       compute the depth
         faidx       index/extract FASTA
         tview       text alignment viewer
         index       index alignment
         idxstats    BAM index stats (r595 or later)
         fixmate     fix mate information
         flagstat    simple stats
         calmd       recalculate MD/NM tags and '=' bases
         merge       merge sorted alignments
         rmdup       remove PCR duplicates
         reheader    replace BAM header
         cat         concatenate BAMs
         bedcov      read depth per BED region
         targetcut   cut fosmid regions (for fosmid pool only)
         phase       phase heterozygotes
         bamshuf     shuffle and group alignments by name

• bcftools (samtools) the classical variant caller used on samtools pileup results

The command was re-baptized bcftools_samtools to differentiate it from the more recent version baptized here bcftools
Program: bcftools (Tools for data in the VCF/BCF formats)
Version: 0.1.19-44428cd

Usage:   bcftools <command> <arguments>

Command: view      print, extract, convert and call SNPs from BCF
         index     index BCF
         cat       concatenate BCFs
         ld        compute all-pair r^2
         ldpair    compute r^2 between requested pairs

• vcfutils.pl (samtools) the vcf toolbox used to filter bcftools results

The command was re-baptized vcfutils_samtools.pl to differentiate it from the bcftools copy vcfutils.pl which may differ at some point
Usage:   vcfutils.pl <command> [<arguments>]

Command: subsam       get a subset of samples
         listsam      list the samples
         fillac       fill the allele count field
         qstats       SNP stats stratified by QUAL

         hapmap2vcf   convert the hapmap format to VCF
         ucscsnp2vcf  convert UCSC SNP SQL dump to VCF

         varFilter    filtering short variants (*)
         vcf2fq       VCF->fastq (**)

Notes: Commands with description ending with (*) may need bcftools
       specific annotations.

• Picard (2.x) [10], the 'broadly' used java wrapper for samtools with added functionalities

## how to run Picard tools in recent versions ?
# provided PICARD is defined and points to the place where the picard.jar file is located, you can type:
java -jar $PICARD/picard.jar <command-name> <command-options...>

## current sorted list of Picard commands:
AddCommentsToBam.jar, AddOrReplaceReadGroups.jar, BamIndexStats.jar, BamToBfq.jar, BuildBamIndex.jar,
CalculateHsMetrics.jar, CheckIlluminaDirectory.jar, CleanSam.jar, CollectAlignmentSummaryMetrics.jar,
CollectGcBiasMetrics.jar, CollectInsertSizeMetrics.jar, CollectMultipleMetrics.jar, CollectRnaSeqMetrics.jar,
CollectTargetedPcrMetrics.jar, CollectWgsMetrics.jar, CompareSAMs.jar, CreateSequenceDictionary.jar,
DownsampleSam.jar, EstimateLibraryComplexity.jar, ExtractIlluminaBarcodes.jar, ExtractSequences.jar,
FastqToSam.jar, FilterSamReads.jar, FixMateInformation.jar, GatherBamFiles.jar, IlluminaBasecallsToFastq.jar,
IlluminaBasecallsToSam.jar, IntervalListTools.jar, MakeSitesOnlyVcf.jar, MarkDuplicates.jar, MarkIlluminaAdapters.jar,
MeanQualityByCycle.jar, MergeBamAlignment.jar, MergeSamFiles.jar, MergeVcfs.jar, NormalizeFasta.jar,
QualityScoreDistribution.jar, ReorderSam.jar, ReplaceSamHeader.jar, RevertOriginalBaseQualitiesAndAddMateCigar.jar,
RevertSam.jar, SamFormatConverter.jar, SamToFastq.jar, SortSam.jar, SplitVcfs.jar, ValidateSamFile.jar,
VcfFormatConverter.jar, ViewSam.jar

## get help in Picard by typing
java -jar $PICARD picard.jar <command> -h

• samtools (htslib), the developer version of samtools

Program: samtools (Tools for alignments in the SAM format)

Version: 0.2.0-rc7-99-gd60458f (using htslib 0.2.0-rc7-37-gdb8055a)

Usage: samtools <command> [options]


 -- indexing
        faidx       index/extract FASTA
        index       index alignment
 -- editing
        calmd       recalculate MD/NM tags and '=' bases
        fixmate     fix mate information
        reheader    replace BAM header
        rmdup       remove PCR duplicates
        targetcut   cut fosmid regions (for fosmid pool only)
 -- file operations
        bamshuf     shuffle and group alignments by name
        cat         concatenate BAMs
        merge       merge sorted alignments
        mpileup     multi-way pileup
        sort        sort alignment file
        split       splits a file by read group
 -- stats
        bedcov      read depth per BED region
        depth       compute the depth
        flagstat    simple stats
        idxstats    BAM index stats
        phase       phase heterozygotes
        stats       generate stats (former bamcheck)
 -- viewing
        flags       explain BAM flags
        tview       text alignment viewer
        view        SAM<->BAM<->CRAM conversion

• bcftools (htslib), the samtools (dev) accompanying toolbox

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
Version: 0.2.0-rc7-52-ge3fe732 (using htslib 0.2.0-rc7-37-gdb8055a)

Usage:   bcftools <command> <argument>


 -- Indexing
    index        index VCF/BCF files

 -- VCF/BCF manipulation
    annotate     annotate and edit VCF/BCF files
    concat       concatenate VCF/BCF files from the same set of samples
    isec         intersections of VCF/BCF files
    merge        merge VCF/BCF files files from non-overlapping sample sets
    norm         left-align and normalize indels
    query        transform VCF/BCF into user-defined formats
    view         VCF/BCF conversion, view, subset and filter VCF/BCF files

 -- VCF/BCF analysis
    call         SNP/indel calling
    filter       filter VCF/BCF files using fixed thresholds
    gtcheck      check sample concordance, detect sample swaps and contamination
    roh          identify runs of autozygosity (HMM)
    stats        produce VCF/BCF stats

 Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
 automatically even when streaming from a pipe. Indexed VCF and BCF will work
 in all situations. Un-indexed VCF and BCF and streams will work in most but
 not all situations.

• bedtools, a multiusage BED/SAM/VCF manipulation suite

Get the full BedTools PDF manual from https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf [11]

Read it online at http://bedtools.readthedocs.org/ [12]

bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
usage:    bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]
    intersect     Find overlapping intervals in various ways.
    window        Find overlapping intervals within a window around an interval.
    closest       Find the closest, potentially non-overlapping interval.
    coverage      Compute the coverage over defined intervals.
    map           Apply a function to a column for each overlapping interval.
    genomecov     Compute the coverage over an entire genome.
    merge         Combine overlapping/nearby intervals into a single interval.
    cluster       Cluster (but don't merge) overlapping/nearby intervals.
    complement    Extract intervals _not_ represented by an interval file.
    subtract      Remove intervals based on overlaps b/w two files.
    slop          Adjust the size of intervals.
    flank         Create new intervals from the flanks of existing intervals.
    sort          Order the intervals in a file.
    random        Generate random intervals in a genome.
    shuffle       Randomly redistrubute intervals in a genome.
    annotate      Annotate coverage of features from multiple files.

[ Multi-way file comparisons ]
    multiinter    Identifies common intervals among multiple interval files.
    unionbedg     Combines coverage intervals from multiple BEDGRAPH files.

[ Paired-end manipulation ]
    pairtobed     Find pairs that overlap intervals in various ways.
    pairtopair    Find pairs that overlap other pairs in various ways.

[ Format conversion ]
    bamtobed      Convert BAM alignments to BED (& other) formats.
    bedtobam      Convert intervals to BAM records.
    bamtofastq    Convert BAM records to FASTQ records.
    bedpetobam    Convert BEDPE intervals to BAM records.
    bed12tobed6   Breaks BED12 intervals into discrete BED6 intervals.

[ Fasta manipulation ]
    getfasta      Use intervals to extract sequences from a FASTA file.
    maskfasta     Use intervals to mask sequences from a FASTA file.
    nuc           Profile the nucleotide content of intervals in a FASTA file.

[ BAM focused tools ]
    multicov      Counts coverage from multiple BAMs at specific intervals.
    tag           Tag BAM alignments based on overlaps with interval files.

[ Statistical relationships ]
    jaccard       Calculate the Jaccard statistic b/w two sets of intervals.
    reldist       Calculate the distribution of relative distances b/w two files.

[ Miscellaneous tools ]
    overlap       Computes the amount of overlap from two intervals.
    igv           Create an IGV snapshot batch script.
    links         Create a HTML page of links to UCSC locations.
    makewindows   Make interval "windows" across a genome.
    groupby       Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
    expand        Replicate lines based on lists of values in columns.

[ General help ]
    --help        Print this help menu.
    --version     What version of bedtools are you using?.
    --contact     Feature requests, bugs, mailing lists, etc.

• IGVtools, the toolbox used to prepare files for IGV (compress, index, and more)

Program: igvtools. IGV Version 2.3.32 (31)03/17/2014 09:01 PM

Usage: igvtools [command] [options] [input file/dir] [other arguments]

Command: version print the version number
         sort    sort an alignment file by start position.
         index   index an alignment file
         toTDF    convert an input file (cn, gct, wig) to tiled data format (tdf)
         count   compute coverage density for an alignment file
         formatexp  center, scale, and log2 normalize an expression file
         gui      Start the gui
         help <command>     display this help message, or help on a specific command
         See http://www.broadinstitute.org/software/igv/igvtools_commandline for more detailed help

• annotate_variation.pl; most used Annovar command (but not only!)

     annotate_variation.pl [arguments] <query-file|table-name> <database-location>
     Optional arguments:
            -h, --help                      print help message
            -m, --man                       print complete documentation
            -v, --verbose                   use verbose output
            Arguments to download databases or perform annotations
                --downdb                    download annotation database
                --geneanno                  annotate variants by gene-based annotation (infer functional consequence on genes)
                --regionanno                annotate variants by region-based annotation (find overlapped regions in database)
                --filter                    annotate variants by filter-based annotation (find identical variants in database)
            Arguments to control input and output
                --outfile <file>            output file prefix
                --webfrom <string>          specify the source of database (ucsc or annovar or URL) (downdb operation)
                --dbtype <string>           specify database type
                --buildver <string>         specify genome build version (default: hg18 for human)
                --time                      print out local time during program run
                --comment                   print out comment line (those starting with #) in output files 
                --exonsort                  sort the exon number in output line (gene-based annotation)
                --transcript_function       use transcript name rather than gene name (gene-based annotation)
                --hgvs                      use HGVS format for exonic annotation (c.122C>T rather than c.C122T) (gene-based annotation)
                --separate                  separately print out all functions of a variant in several lines (gene-based annotation)
                --seq_padding               create a new file with cDNA sequence padded by this much either side (gene-based annotation)
                --(no)firstcodondel         treat first codon deletion as wholegene deletion (default: ON) (gene-based annotation)
                --aamatrix <file>           specify an amino acid substitution matrix file (gene-based annotation)
                --colsWanted <string>       specify which columns to output by comma-delimited numbers (region-based annotation)
                --scorecolumn <int>         the column with scores in DB file (region-based annotation)
                --gff3dbfile <file>         specify a DB file in GFF3 format (region-based annotation)
                --bedfile <file>            specify a DB file in BED format file (region-based annotation)
                --gff3attribute             output all fields in GFF3 attribute (default: ID and score only)
                --genericdbfile <file>      specify a DB file in generic format (filter-based annotation)
                --vcfdbfile <file>          specify a DB file in VCF format (filter-based annotation)
                --otherinfo                 print out additional columns in database file (filter-based annotation)
                --infoasscore               use INFO field in VCF file as score in output (filter-based annotation)
            Arguments to fine-tune the annotation procedure
                --batchsize <int>           batch size for processing variants per batch (default: 5m)
                --genomebinsize <int>       bin size to speed up search (default: 100k for -geneanno, 10k for -regionanno)
                --expandbin <int>           check nearby bin to find neighboring genes (default: 2m/genomebinsize)
                --neargene <int>            distance threshold to define upstream/downstream of a gene
                --exonicsplicing            report exonic variants near exon/intron boundary as 'exonic;splicing' variants
                --score_threshold <float>   minimum score of DB regions to use in annotation
                --reverse                   reverse directionality to compare to score_threshold
                --normscore_threshold <float> minimum normalized score of DB regions to use in annotation
                --rawscore                  output includes the raw score (not normalized score) in UCSC Browser Track
                --minqueryfrac <float>      minimum percentage of query overlap to define match to DB (default: 0)
                --splicing_threshold <int>  distance between splicing variants and exon/intron boundary (default: 2)
                --indel_splicing_threshold <int>    if set, use this value for allowed indel size for splicing variants (default: --splicing_threshold)
                --maf_threshold <float>     filter 1000G variants with MAF above this threshold (default: 0)
                --sift_threshold <float>    SIFT threshold for deleterious prediction for -dbtype avsift (default: 0.05)
                --precedence <string>       comma-delimited to specify precedence of variant function (default: exonic>intronic...)
                --indexfilter_threshold <float>     controls whether filter-based annotation use index if this fraction of bins need to be scanned (default: 0.9)
           Arguments to control memory usage
                --memfree <int>             ensure minimum amount of free system memory (default: 0)
                --memtotal <int>            limit total amount of memory used by ANNOVAR (default: 0, unlimited, in the order of kb)
                --chromosome <string>       examine these specific chromosomes in database file
     Function: annotate a list of genetic variants against genome annotation 
     databases stored at local disk.
     Example: #download gene annotation database (for hg18 build) and save to humandb/ directory
              annotate_variation.pl -downdb refGene humandb/
              annotate_variation.pl -buildver mm9 -downdb  mousedb/
              annotate_variation.pl -downdb -webfrom annovar esp6500si_all humandb/
              #gene-based annotation of variants in the varlist file (by default --geneanno is ON)
              annotate_variation.pl ex1.human humandb/
              #region-based annotate variants
              annotate_variation.pl -regionanno -dbtype cytoBand ex1.human humandb/
              annotate_variation.pl -regionanno -dbtype gff3 -gff3dbfile tfbs.gff3 ex1.human humandb/
              #filter rare or unreported variants (in 1000G/dbSNP) or predicted deleterious variants
              annotate_variation.pl -filter -dbtype 1000g2012apr_all -maf 0.01 ex1.human humandb/
              annotate_variation.pl -filter -dbtype snp130 ex1.human humandb/
              annotate_variation.pl -filter -dbtype avsift ex1.human humandb/
     Version: $LastChangedDate: 2013-08-23 11:32:41 -0700 (Fri, 23 Aug 2013) $

• snpEff commands used to add annotations to VCF data

snpEff version SnpEff 3.4 (build 2013-11-23), by Pablo Cingolani
Usage: snpEff [command] [options] [files]
Available commands: 
   [eff]           : Calculate effect of variants. Default: eff (no command or 'eff').
   build           : Build a SnpEff database.
   buildNextProt   : Build a SnpEff for NextProt (using NextProt's XML files).
   cds             : Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness.
   closest         : Annotate the closest genomic region.
   count           : Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval.
   databases       : Show currently available databases (from local config file).
   download        : Download a SnpEff database.
   dump            : Dump to STDOUT a SnpEff database (mostly used for debugging).
   genes2bed       : Create a bed file from a genes list.
   len             : Calculate total genomic length for each marker type.
   protein         : Compare protein sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness.
   spliceAnalysis  : Perform an analysis of splice sites. Experimental feature.
Run 'java -jar snpEff.jar command' for help on each specific command
Generic options:
	-c , -config     : Specify config file
	-d , -debug      : Debug mode (very verbose).
	-dataDir <path>  : Override data_dir parameter from config file.
	-h , -help       : Show this help and exit
	-noLog           : Do not report usage statistics to server
	-q , -quiet      : Quiet mode (do not show any messages or errors)
	-v , -verbose    : Verbose mode

• snpSift commands used to filter snpEff-annotated VCF data

SnpSift version SnpSift 3.4 (build 2013-11-23), by Pablo Cingolani
Usage: java -jar SnpSift.jar [command] params...
Command is one of:
	alleleMat     : Create an allele matrix output.
	annotate      : Annotate 'ID' from a database (e.g. dbSnp). Assumes entries are sorted.
	annMem        : Annotate 'ID' from a database (e.g. dbSnp). Loads db in memory. Does not assume sorted entries.
	caseControl   : Compare how many variants are in 'case' and in 'control' groups; calculate p-values.
	ccs           : Case control summary. Case and control summaries by region, allele frequency and variant's functional effect.
	coevolution   : Co-evoution case control model (this feature is alpha).
	concordance   : Concordance metrics between two VCF files.
	covMat        : Create an covariance matrix output (allele matrix as input).
	dbnsfp        : Annotate with multiple entries from dbNSFP. <EXPERIMENTAL>
	extractFields : Extract fields from VCF file into tab separated format.
	filter        : Filter using arbitrary expressions
	geneSets      : Annotate using MSigDb gene sets (MSigDb includes: GO, KEGG, Reactome, BioCarta, etc.)
	gt            : Add Genotype to INFO fields and remove genotype fields when possible.
	gwasCat       : Annotate using GWAS catalog
	hwe           : Calculate Hardy-Weimberg parameters and perform a godness of fit test.
	intersect     : Intersect intervals (genomic regions).
	intervals     : Keep variants that intersect with intervals.
	intIdx        : Keep variants that intersect with intervals. Index-based method: Used for large VCF file and a few intervals to retrieve
	join          : Join files by genomic region.
	phastCons     : Annotate using conservation scores (phastCons).
	private       : Annotate if a variant is private to a family or group.
	rmRefGen      : Remove reference genotypes.
	rmInfo        : Remove INFO fields.
	sift          : Annotate using SIFT scores from a VCF file.
	split         : Split VCF by chromosome.
	tstv          : Calculate transiton to transversion ratio.
	varType       : Annotate variant type (SNP,MNP,INS,DEL or MIXED).
	vcf2tped      : Convert VCF to TPED.


Technical.png Version number reported for each tool correspond to the lowest version used to prepare this training and may differ from the current version. Older versions may as well do!

Technical.png Additional tools were installed from downloads or from source and are referenced to web links in the respective pages


Tools in the spotlight

Some tools found during the preparation of this training but not covered today include:

  • seqtk if you need to convert between FastA <-> FastQ (by Heng Li)


Usage:   seqtk <command> <arguments>
Version: 1.0-r32

Command: seq       common transformation of FASTA/Q
         comp      get the nucleotide composition of FASTA/Q
         sample    subsample sequences
         subseq    extract subsequences from FASTA/Q
         trimfq    trim FASTQ using the Phred algorithm

         hety      regional heterozygosity
         mutfa     point mutate FASTA at specified positions
         mergefa   merge two FASTA/Q files
         randbase  choose a random base from hets
         cutN      cut sequence at long N
         listhet   extract the position of each het
  • bioawk, a modified AWK to dig into BED/VCF/SAM/... (by Heng Li)


usage: bioawk [-F fs] [-v var=value] [-c fmt] [-tH] [-f progfile | 'prog'] [file ...]

1. List the supported formats:
        # bioawk -c help
        bed: 1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts
        sam 1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual
        vcf: 1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info
        gff: 1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute
        fastx: 1:name 2:seq 3:qual 4:comment

2. Extract unmapped reads without header:
        bioawk -c sam 'and($flag,4)' aln.sam.gz
2b.Extract unmapped read pairs with header: potential novel sequences absent from the reference genome and/or unmappable reads due to sequence content

         bioawk -Hc sam 'and($flag,13)' als.sam

3. Extract mapped reads with header:
        bioawk -Hc sam '!and($flag,4)'
4. Reverse complement FASTA:
        bioawk -c fastx '{print ">"$name;print revcomp($seq)}' seq.fa.gz
5. Create FASTA from SAM (uses revcomp if FLAG & 16)
        samtools view aln.bam | \
            bioawk -c sam '{s=$seq; if(and($flag, 16)) {s=revcomp($seq)} print ">"$qname"\n"s}'
6. Print the genotypes of sample `foo` and `bar` from a VCF:
        grep -v ^## in.vcf | bioawk -tc hdr '{print $foo,$bar}'

Find more tools and answers

There are many tools out there, finding them is often the easiest part. You are welcome to try as many as you wish and improve results obtained with our selected toolbox. When seeking advice, please consider using:

Today's Hands-On Training

Housekeeping rules

All exercises were performed by the trainer on one of the BITS laptops and the results moved to a new subfolder called Final_Results on the BITS laptops. This ensures that you can take up any exercise and will find the necessary input data in the corresponding folder even if you did not perform the previous exercises.

The results that You will generate will be stored in similarly named folders created directly under your working folder and will not overwrite the 'good' versions stored in Final_Results.

Additionally, key result files were copied on the BITS file server and can be downloaded from links present at the bottom of each exercise page.

Technical.png All scripts described during the training are store in the scripts folder. These scripts, as well as all shorter command blocks presented in the exercises expect that your terminal points to the training folder. The training folder is known by the laptop as BASE. You should always run your commands while being in the BASE-camp. To make sure you are at the right address, type at any time: cd $BASE Enter.

The Exercises

Code developed in these exercises

Most scripts included in the exercises below are reproduced and maintained on our dedicated GitHub page https://github.com/splaisan/NGS-Variant-Analysis-training [18] (please feed-us back about any issue with this page).

Exercises with a coloured text background are long and not required for today. You can keep them for later.

Bonus material


This hands-on training session was aimed at presenting a simplified although complete workflow for NGS variant analysis and to train users in taking advantage of the many command-line interface resources.

The number of research groups generating NGS data is increasing upon time and the demand for data analysis is getting greater than the capacity of the facilities within the Institute and too expensive through outsourced commercial help. For these reason, more and more students and staff scientists feel the 'urgent need' to perform their own NGS data analysis.

Although a certain level of computer infrastructure is required in order to perform this analysis in ideal conditions, knowledge and training on NGS analysis seem to be today's main limiting factors in research groups.

This session demonstrates that a minimum of training is sufficient to allow classical NGS analysis on a laptop computer. For higher throughput an more complex analysis, additional learning and more resources will be required but a lot can be initiated using the many available online resources and communities.

Your feedback to this two-days general command-line based NGS data analysis session is very important to us and will be used to improve this content for later sessions. If you need more training of this kind, please contact us and we will organize additional hands-on based on your requests. More advance sessions will depend on the availability of expert users within VIB that will accept to prepare specified material.

PS: We tried to cite resources and people that made this training possible. If you feel that some citations are missing, please inform us and we will correct this omission.

contact us

  1. http://www.1000genomes.org
  2. http://main.g2.bx.psu.edu
  3. http://galaxy.bits.vib.be
  4. http://www.broadinstitute.org/gatk/
  5. http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM18507
  6. http://wiki.bits.vib.be/index.php/Introduction_to_Linux_for_bioinformatics
  7. http://wiki.bits.vib.be/index.php/NGS_data_analysis
  8. Stephan Pabinger, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler, Michael R Speicher, Johannes Zschocke, Zlatko Trajanoski
    A survey of tools for variant analysis of next-generation genome sequencing data.
    Brief Bioinform: 2014, 15(2);256-78
    [PubMed:23341494] ##WORLDCAT## [DOI] (I p)

  9. http://www.gnu.org/software/coreutils/manual/coreutils.pdf
  10. http://picard.sourceforge.net
  11. https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf
  12. http://bedtools.readthedocs.org/
  13. https://media.readthedocs.org/pdf/bedops/latest/bedops.pdf
  14. http://seqanswers.com SeqAnswers
  15. http://www.biostars.org BioStar
  16. http://stackoverflow.com stackoverflow
  17. http://bioinformatics.ca/links_directory
  18. https://github.com/splaisan/NGS-Variant-Analysis-training

[ Main_Page ]