Hands-on introduction to NGS variant analysis-2016

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS_data_analysis ]
# one-day training 2016 session


Technical.png This session is a simplified and rewritten version of a more complete and exploratory training given in 2013 and 2014 (Hands-on introduction to NGS variant analysis)

Technical.png The most recent version of this training (2018) can be found at (Hands-on_introduction_to_NGS_variant_analysis-2018)

A PDF printout of the training pages was generated for your convenience. When possible, favour using the Wiki which may contain updated information as compared to the assembled document.


Aims of the NGS DNA variant analysis 1-day session

Using a full publicly available chromosome read-set from one of the 1000 genomes[1] samples:

  • Perform a complete analysis workflow including QC, read mapping, read coverage analysis, and variant calling against the human reference genome.
  • Use command line and popular open source software to evaluate each step of a classical NGS variant workflow and feel the complexity of the task.
  • Compare the obtained results with gold standard public data.
  • Visualise selected results in IGV


The skills acquired during this session should allow reproduction of the workflow by participants using their own reads and a reasonably-sized personal computer running Unix.

Technical.png All scripts as well as the most important result files are put online to allow offline practice to trainees and web visitors

More BITS Training Info


This training gives an introduction to the use of popular NGS analysis software packages at the command line. It reviews several exchangeable tools and provides hints to evaluate quality and content of Genome-Seq data. The code used in the training pages can easily be recycled and adapted, and will constitute the basis of your own workflows.

The sequencing data used in this session was obtained from gDNA extracted from EBV-transformed B-lymphocytes of a healthy Nigerian individual (NA18507). More information is available for that sample from the Coriell repository from which 1000g gDNA can be obtained [2].

This training does not cover all currently available methods. It does not aim at bringing users to a professional NGS analyst level but provides enough information to allow motivated biologists understand what DNA sequencing practically is, to read and reuse command-line code, and when necessary to communicate knowingly with NGS experts for more in-depth needs.


Skills required to follow this training:

  • Linux command line basic skills are absolutely required to understand the code used throughout the session
  • basic knowledge of human genome structure and nomenclature is necessary to estimate the training tasks
  • basic knowledge of Illumina NGS read structure is also required for the same reason

Software used during this training:

  • All programs used in this training session were installed on the BITS laptops and are freely available for non-commercial use from links added to the training pages
  • Users with a minimal knowledge in unix command-line should be able to install the programs on their lab desktop or strong laptop computer to reproduce commands used in this tutorial

In addition to the many built-in unix command, originating from the GNU consortium for a good part (please read the GNU coreutils manual for more [3]), we will use today several dedicated toolboxes including:

• fastQC, the very popular NGS reads QC application

            FastQC - A high throughput sequence QC analysis tool


        fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
           [-c contaminant file] seqfile1 .. seqfileN


    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of
    which will help to identify a different potential type of problem in your
    If no files to process are specified on the command line then the program
    will start as an interactive graphical application.  If files are provided
    on the command line then the program will run with no user interaction
    required.  In this mode it is suitable for inclusion into a standardised
    analysis pipeline.
    The options for the program as as follows:
    -h --help       Print this help file and exit
    -v --version    Print the version of the program and exit
    -o --outdir     Create all output files in the specified output directory.
                    Please note that this directory must exist as the program
                    will not create it.  If this option is not set then the
                    output file for each sequence file is created in the same
                    directory as the sequence file which was processed.
    --casava        Files come from raw casava output. Files in the same sample
                    group (differing only by the group number) will be analysed
                    as a set rather than individually. Sequences with the filter
                    flag set in the header will be excluded from the analysis.
                    Files must have the same names given to them by casava
                    (including being gzipped and ending with .gz) otherwise they
                    won't be grouped together correctly.
    --nano          Files come from naopore sequences and are in fast5 format. In
                    this mode you can pass in directories to process and the program
                    will take in all fast5 files within those directories and produce
                    a single output file from the sequences found in all files.                    
    --nofilter      If running with --casava then don't remove read flagged by
                    casava as poor quality when performing the QC analysis.
    --extract       If set then the zipped output file will be uncompressed in
                    the same directory after it has been created.  By default
                    this option will be set if fastqc is run in non-interactive
    -j --java       Provides the full path to the java binary you want to use to
                    launch fastqc. If not supplied then java is assumed to be in
                    your path.
    --noextract     Do not uncompress the output file after creating it.  You
                    should set this option if you do not wish to uncompress
                    the output when running in non-interactive mode.
    --nogroup       Disable grouping of bases for reads >50bp. All reports will
                    show data for every base in the read.  WARNING: Using this
                    option will cause fastqc to crash and burn if you use it on
                    really long reads, and your plots may end up a ridiculous size.
                    You have been warned!
    -f --format     Bypasses the normal sequence file format detection and
                    forces the program to use the specified format.  Valid
                    formats are bam,sam,bam_mapped,sam_mapped and fastq
    -t --threads    Specifies the number of files which can be processed
                    simultaneously.  Each thread will be allocated 250MB of
                    memory so you shouldn't run more threads than your
                    available memory will cope with, and not more than
                    6 threads on a 32 bit machine
    -c              Specifies a non-default file which contains the list of
    --contaminants  contaminants to screen overrepresented sequences against.
                    The file must contain sets of named contaminants in the
                    form name[tab]sequence.  Lines prefixed with a hash will
                    be ignored.

    -a              Specifies a non-default file which contains the list of
    --adapters      adapter sequences which will be explicity searched against
                    the library. The file must contain sets of named adapters
                    in the form name[tab]sequence.  Lines prefixed with a hash
                    will be ignored.
    -l              Specifies a non-default file which contains a set of criteria
    --limits        which will be used to determine the warn/error limits for the
                    various modules.  This file can also be used to selectively
                    remove some modules from the output all together.  The format
                    needs to mirror the default limits.txt file found in the
                    Configuration folder.
   -k --kmers       Specifies the length of Kmer to look for in the Kmer content
                    module. Specified Kmer length must be between 2 and 10. Default
                    length is 7 if not specified.
   -q --quiet       Supress all progress messages on stdout and only report errors.
   -d --dir         Selects a directory to be used for temporary files written when
                    generating report images. Defaults to system temp directory if
                    not specified.

    Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk
    or in www.bioinformatics.babraham.ac.uk/bugzilla/

• bwa, the Burrow Wheeler Aligner for mapping reads to the reference genome

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1039
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         shm           manage indices in shared memory
         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are three alignment algorithms in BWA: `mem', `bwasw', and
      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
      first. Please `man ./bwa.1' for the manual.

• Picard (2.x) [4], the 'broadly' used java wrapper for samtools with added functionalities

## how to run Picard tools in recent versions ?
# provided PICARD is defined and points to the place where the picard.jar file is located, you can type:
java -jar $PICARD/picard.jar <command-name> <command-options...>

## current sorted list of Picard commands:
AddCommentsToBam.jar, AddOrReplaceReadGroups.jar, BamIndexStats.jar, BamToBfq.jar, BuildBamIndex.jar,
CalculateHsMetrics.jar, CheckIlluminaDirectory.jar, CleanSam.jar, CollectAlignmentSummaryMetrics.jar,
CollectGcBiasMetrics.jar, CollectInsertSizeMetrics.jar, CollectMultipleMetrics.jar, CollectRnaSeqMetrics.jar,
CollectTargetedPcrMetrics.jar, CollectWgsMetrics.jar, CompareSAMs.jar, CreateSequenceDictionary.jar,
DownsampleSam.jar, EstimateLibraryComplexity.jar, ExtractIlluminaBarcodes.jar, ExtractSequences.jar,
FastqToSam.jar, FilterSamReads.jar, FixMateInformation.jar, GatherBamFiles.jar, IlluminaBasecallsToFastq.jar,
IlluminaBasecallsToSam.jar, IntervalListTools.jar, MakeSitesOnlyVcf.jar, MarkDuplicates.jar, MarkIlluminaAdapters.jar,
MeanQualityByCycle.jar, MergeBamAlignment.jar, MergeSamFiles.jar, MergeVcfs.jar, NormalizeFasta.jar,
QualityScoreDistribution.jar, ReorderSam.jar, ReplaceSamHeader.jar, RevertOriginalBaseQualitiesAndAddMateCigar.jar,
RevertSam.jar, SamFormatConverter.jar, SamToFastq.jar, SortSam.jar, SplitVcfs.jar, ValidateSamFile.jar,
VcfFormatConverter.jar, ViewSam.jar

## get help in Picard by typing
java -jar $PICARD picard.jar <command> -h

• samtools 1.x, the new (htslib) version of samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.1 (using htslib 1.3.1)

Usage:   samtools <command> [options]

  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     rmdup          remove PCR duplicates
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

• bcftools 1.x (htslib), the samtools (htslib) accompanying toolbox

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
Version: 1.3.1 (using htslib 1.3.1)

Usage:   bcftools [--version|--version-only] [--help] <command> <argument>


 -- Indexing
    index        index VCF/BCF files

 -- VCF/BCF manipulation
    annotate     annotate and edit VCF/BCF files
    concat       concatenate VCF/BCF files from the same set of samples
    convert      convert VCF/BCF files to different formats and back
    isec         intersections of VCF/BCF files
    merge        merge VCF/BCF files files from non-overlapping sample sets
    norm         left-align and normalize indels
    plugin       user-defined plugins
    query        transform VCF/BCF into user-defined formats
    reheader     modify VCF/BCF header, change sample names
    view         VCF/BCF conversion, view, subset and filter VCF/BCF files

 -- VCF/BCF analysis
    call         SNP/indel calling
    consensus    create consensus sequence by applying VCF variants
    cnv          HMM CNV calling
    filter       filter VCF/BCF files using fixed thresholds
    gtcheck      check sample concordance, detect sample swaps and contamination
    roh          identify runs of autozygosity (HMM)
    stats        produce VCF/BCF stats

 Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
 automatically even when streaming from a pipe. Indexed VCF and BCF will work
 in all situations. Un-indexed VCF and BCF and streams will work in most but
 not all situations.

• qualimap, evaluating next generation sequencing alignment data

## To show available tools use command:
qualimap --help
QualiMap v.2.2
Built on 2016-01-29 12:10
usage: qualimap <tool> [options]
To launch GUI leave <tool> empty.
Available tools:
    bamqc            Evaluate NGS mapping to a reference genome
    rnaseq           Evaluate RNA-seq alignment data
    counts           Counts data analysis (further RNA-seq data evaluation)
    multi-bamqc      Compare QC reports from multiple NGS mappings
    clustering       Cluster epigenomic signals
    comp-counts      Compute feature counts
Special arguments: 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
### BAM QC #######################################
## The following command allows to perform BAM QC analysis:
QualiMap v.2.2
Built on 2016-01-29 12:10
Selected tool: bamqc
ERROR: Unrecognized option: --help
usage: qualimap bamqc -bam <arg> [-c] [-gd <arg>] [-gff <arg>] [-hm <arg>] [-ip]
       [-nr <arg>] [-nt <arg>] [-nw <arg>] [-oc <arg>] [-os] [-outdir <arg>]
       [-outfile <arg>] [-outformat <arg>] [-p <arg>] [-sd] [-sdmode <arg>]
 -bam <arg>                           Input mapping file in BAM format
 -c,--paint-chromosome-limits         Paint chromosome limits inside charts
 -gd,--genome-gc-distr <arg>          Species to compare with genome GC
                                      distribution. Possible values: HUMAN -
                                      hg19; MOUSE - mm9(default), mm10
 -gff,--feature-file <arg>            Feature file with regions of interest in
                                      GFF/GTF or BED format
 -hm <arg>                            Minimum size for a homopolymer to be
                                      considered in indel analysis (default is
 -ip,--collect-overlap-pairs          Activate this option to collect statistics
                                      of overlapping paired-end reads
 -nr <arg>                            Number of reads analyzed in a chunk
                                      (default is 1000)
 -nt <arg>                            Number of threads (default is 4)
 -nw <arg>                            Number of windows (default is 400)
 -oc,--output-genome-coverage <arg>   File to save per base non-zero coverage.
                                      Warning: large files are expected for
                                      large genomes
 -os,--outside-stats                  Report information for the regions outside
                                      those defined by feature-file  (ignored
                                      when -gff option is not set)
 -outdir <arg>                        Output folder for HTML report and raw
 -outfile <arg>                       Output file for PDF report (default value
                                      is report.pdf).
 -outformat <arg>                     Format of the ouput report (PDF or HTML,
                                      default is HTML).
 -p,--sequencing-protocol <arg>       Sequencing library protocol:
                                      strand-specific-reverse or
                                      non-strand-specific (default)
 -sd,--skip-duplicated                Activate this option to skip duplicated
                                      alignments from the analysis. If the
                                      duplicates are not flagged in the BAM
                                      file, then they will be detected by
                                      Qualimap and can be selected for skipping.
 -sdmode,--skip-dup-mode <arg>        Specific type of duplicated alignments to
                                      skip (if this option is activated).
                                      0 : only flagged duplicates (default)
                                      1 : only estimated by Qualimap
                                      2 : both flagged and estimated
Special arguments: 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
###  Counts QC  ##########################################################
## To perform counts QC analysis (evaluation of RNA-seq data) use the following command:
QualiMap v.2.2
Built on 2016-01-29 12:10
Selected tool: counts
ERROR: Unrecognized option: --help
usage: qualimap counts [-c] -d <arg> [-i <arg>] [-k <arg>] [-outdir <arg>]
       [-outfile <arg>] [-outformat <arg>] [-R <arg>] [-s <arg>]
 -c,--compare             Perform comparison of conditions. Currently 2 maximum
                          is possible.
 -d,--data <arg>          File describing the input data. Format of the file is
                          a 4-column tab-delimited table.
                          Column 1: sample name
                          Column 2: condition of the sample
                          Column 3: path to the counts data for the sample
                          Column 4: index of the column with counts
 -i,--info <arg>          Path to info file containing genes GC-content, length
                          and type.
 -k,--threshold <arg>     Threshold for the number of counts
 -outdir <arg>            Output folder for HTML report and raw data.
 -outfile <arg>           Output file for PDF report (default value is
 -outformat <arg>         Format of the ouput report (PDF or HTML, default is
 -R,--rscriptpath <arg>   Path to Rscript executable (by default it is assumed
                          to be available from system $PATH)
 -s,--species <arg>       Use built-in info file for the given species: HUMAN or
Special arguments: 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
### Clustering ##############################################
## To perform clustering of epigenomic signals use the following command:
QualiMap v.2.2
Built on 2016-01-29 12:10
Selected tool: clustering
ERROR: Unrecognized option: --help
usage: qualimap clustering [-b <arg>] [-c <arg>] -control <arg> [-expr <arg>]
       [-f <arg>] [-l <arg>] [-name <arg>] [-outdir <arg>] [-outfile <arg>]
       [-outformat <arg>] [-R <arg>] [-r <arg>] -regions <arg> -sample <arg>
       [-viz <arg>]
 -b,--bin-size <arg>          Size of the bin (default is 100)
 -c,--clusters <arg>          Comma-separated list of cluster sizes
 -control <arg>               Comma-separated list of control BAM files
 -expr <arg>                  Name of the experiment
 -f,--fragment-length <arg>   Smoothing length of a fragment
 -l <arg>                     Upstream offset (default is 2000)
 -name <arg>                  Comma-separated names of the replicates
 -outdir <arg>                Output folder for HTML report and raw data.
 -outfile <arg>               Output file for PDF report (default value is
 -outformat <arg>             Format of the ouput report (PDF or HTML, default
                              is HTML).
 -R,--rscriptpath <arg>       Path to Rscript executable (by default it is
                              assumed to be available from system $PATH)
 -r <arg>                     Downstream offset (default is 500)
 -regions <arg>               Path to regions file
 -sample <arg>                Comma-separated list of sample BAM files
 -viz <arg>                   Visualization type: heatmap or line
Special arguments: 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
### Compute counts ######################################
## To compute counts from mapping data use the following command:
QualiMap v.2.2
Built on 2016-01-29 12:10
Selected tool: comp-counts
ERROR: Unrecognized option: --help
usage: qualimap comp-counts [-a <arg>] -bam <arg> -gtf <arg> [-id <arg>] [-out
       <arg>] [-p <arg>] [-pe] [-s] [-type <arg>]
 -a,--algorithm <arg>             Counting algorithm:
                                  uniquely-mapped-reads(default) or proportional
 -bam <arg>                       Mapping file in BAM format
 -gtf <arg>                       Region file in GTF, GFF or BED format. If GTF
                                  format is provided, counting is based on
                                  attributes, otherwise based on feature name
 -id <arg>                        GTF-specific. Attribute of the GTF to be used
                                  as feature ID. Regions with the same ID will
                                  be aggregated as part of the same feature.
                                  Default: gene_id.
 -out <arg>                       Output file of coverage report.
 -p,--sequencing-protocol <arg>   Sequencing library protocol:
                                  strand-specific-reverse or non-strand-specific
 -pe,--paired                     Setting this flag for paired-end experiments
                                  will result in counting fragments instead of
 -s,--sorted                      This flag indicates that the input file is
                                  already sorted by name. If not set, additional
                                  sorting by name will be performed. Only
                                  required for paired-end analysis.
 -type <arg>                      GTF-specific. Value of the third column of the
                                  GTF considered for counting. Other types will
                                  be ignored. Default: exon
Special arguments: 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
## also see 'rnaseq' and 'multi-bamqc'

• bedtools, a multiusage BED/SAM/VCF manipulation suite

Get the full BedTools PDF manual from https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf [5]

Read it online at http://bedtools.readthedocs.org/ [6]

bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
usage:    bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]
    intersect     Find overlapping intervals in various ways.
    window        Find overlapping intervals within a window around an interval.
    closest       Find the closest, potentially non-overlapping interval.
    coverage      Compute the coverage over defined intervals.
    map           Apply a function to a column for each overlapping interval.
    genomecov     Compute the coverage over an entire genome.
    merge         Combine overlapping/nearby intervals into a single interval.
    cluster       Cluster (but don't merge) overlapping/nearby intervals.
    complement    Extract intervals _not_ represented by an interval file.
    subtract      Remove intervals based on overlaps b/w two files.
    slop          Adjust the size of intervals.
    flank         Create new intervals from the flanks of existing intervals.
    sort          Order the intervals in a file.
    random        Generate random intervals in a genome.
    shuffle       Randomly redistrubute intervals in a genome.
    sample        Sample random records from file using reservoir sampling.
    spacing       Report the gap lengths between intervals in a file.
    annotate      Annotate coverage of features from multiple files.

[ Multi-way file comparisons ]
    multiinter    Identifies common intervals among multiple interval files.
    unionbedg     Combines coverage intervals from multiple BEDGRAPH files.

[ Paired-end manipulation ]
    pairtobed     Find pairs that overlap intervals in various ways.
    pairtopair    Find pairs that overlap other pairs in various ways.

[ Format conversion ]
    bamtobed      Convert BAM alignments to BED (& other) formats.
    bedtobam      Convert intervals to BAM records.
    bamtofastq    Convert BAM records to FASTQ records.
    bedpetobam    Convert BEDPE intervals to BAM records.
    bed12tobed6   Breaks BED12 intervals into discrete BED6 intervals.

[ Fasta manipulation ]
    getfasta      Use intervals to extract sequences from a FASTA file.
    maskfasta     Use intervals to mask sequences from a FASTA file.
    nuc           Profile the nucleotide content of intervals in a FASTA file.

[ BAM focused tools ]
    multicov      Counts coverage from multiple BAMs at specific intervals.
    tag           Tag BAM alignments based on overlaps with interval files.

[ Statistical relationships ]
    jaccard       Calculate the Jaccard statistic b/w two sets of intervals.
    reldist       Calculate the distribution of relative distances b/w two files.
    fisher        Calculate Fisher statistic b/w two feature files.

[ Miscellaneous tools ]
    overlap       Computes the amount of overlap from two intervals.
    igv           Create an IGV snapshot batch script.
    links         Create a HTML page of links to UCSC locations.
    makewindows   Make interval "windows" across a genome.
    groupby       Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
    expand        Replicate lines based on lists of values in columns.
    split         Split a file into multiple files with equal records or base pairs.

[ General help ]
    --help        Print this help menu.
    --version     What version of bedtools are you using?.
    --contact     Feature requests, bugs, mailing lists, etc.

• IGVtools, the toolbox used to prepare files for IGV (compress, index, and more)

Program: igvtools. IGV Version 2.3.72 (89)04/01/2016 11:43 AM

Usage: igvtools [command] [options] [input file/dir] [other arguments]

Command: version print the version number
         sort    sort an alignment file by start position.
         index   index an alignment file
         toTDF    convert an input file (cn, gct, wig) to tiled data format (tdf)
         count   compute coverage density for an alignment file
         formatexp  center, scale, and log2 normalize an expression file
         gui      Start the gui
         help <command>     display this help message, or help on a specific command
         See http://www.broadinstitute.org/software/igv/igvtools_commandline for more detailed help

• snpEff commands used to add annotations to VCF data

java -jar $SNPEFF/snpEff.jar -h
SnpEff version SnpEff 4.2 (build 2015-12-05), by Pablo Cingolani
Usage: snpEff [command] [options] [files]
Run 'java -jar snpEff.jar command' for help on each specific command
Available commands: 
	[eff|ann]                    : Annotate variants / calculate effects (you can use either 'ann' or 'eff', they mean the same). Default: ann (no command or 'ann').
	build                        : Build a SnpEff database.
	buildNextProt                : Build a SnpEff for NextProt (using NextProt's XML files).
	cds                          : Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness.
	closest                      : Annotate the closest genomic region.
	count                        : Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval.
	databases                    : Show currently available databases (from local config file).
	download                     : Download a SnpEff database.
	dump                         : Dump to STDOUT a SnpEff database (mostly used for debugging).
	genes2bed                    : Create a bed file from a genes list.
	len                          : Calculate total genomic length for each marker type.
	pdb                          : Build interaction database (based on PDB data).
	protein                      : Compare protein sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness.
	show                         : Show a text representation of genes or transcripts coordiantes, DNA sequence and protein sequence.
Generic options:
	-c , -config                 : Specify config file
	-configOption name=value     : Override a config file option
	-d , -debug                  : Debug mode (very verbose).
	-dataDir <path>              : Override data_dir parameter from config file.
	-download                    : Download a SnpEff database, if not available locally. Default: true
	-nodownload                  : Do not download a SnpEff database, if not available locally.
	-h , -help                   : Show this help and exit
	-noLog                       : Do not report usage statistics to server
	-t                           : Use multiple threads (implies '-noStats'). Default 'off'
	-q , -quiet                  : Quiet mode (do not show any messages or errors)
	-v , -verbose                : Verbose mode
	-version                     : Show version number and exit
Database options:
	-canon                       : Only use canonical transcripts.
	-interaction                 : Annotate using inteactions (requires interaciton database). Default: true
	-interval <file>             : Use a custom intervals in TXT/BED/BigBed/VCF/GFF file (you may use this option many times)
	-maxTSL <TSL_number>         : Only use transcripts having Transcript Support Level lower than <TSL_number>.
	-motif                       : Annotate using motifs (requires Motif database). Default: true
	-nextProt                    : Annotate using NextProt (requires NextProt database).
	-noGenome                    : Do not load any genomic database (e.g. annotate using custom files).
	-noInteraction               : Disable inteaction annotations
	-noMotif                     : Disable motif annotations.
	-noNextProt                  : Disable NextProt annotations.
	-onlyReg                     : Only use regulation tracks.
	-onlyProtein                 : Only use protein coding transcripts. Default: false
	-onlyTr <file.txt>           : Only use the transcripts in this file. Format: One transcript ID per line.
	-reg <name>                  : Regulation track to use (this option can be used add several times).
	-ss , -spliceSiteSize <int>  : Set size for splice sites (donor and acceptor) in bases. Default: 2
	-spliceRegionExonSize <int>  : Set size for splice site region within exons. Default: 3 bases
	-spliceRegionIntronMin <int> : Set minimum number of bases for splice site region within intron. Default: 3 bases
	-spliceRegionIntronMax <int> : Set maximum number of bases for splice site region within intron. Default: 8 bases
	-strict                      : Only use 'validated' transcripts (i.e. sequence has been checked). Default: false
	-ud , -upDownStreamLen <int> : Set upstream downstream interval length (in bases)

• snpSift commands used to filter snpEff-annotated VCF data

java -jar $SNPEFF/SnpSift.jar -h
SnpSift version 4.2 (build 2015-12-05), by Pablo Cingolani
Usage: java -jar SnpSift.jar [command] params...
Command is one of:
	alleleMat     : Create an allele matrix output.
	annotate      : Annotate 'ID' from a database (e.g. dbSnp). Assumes entries are sorted.
	annMem        : Annotate 'ID' from a database (e.g. dbSnp). Loads db in memory. Does not assume sorted entries.
	caseControl   : Compare how many variants are in 'case' and in 'control' groups; calculate p-values.
	ccs           : Case control summary. Case and control summaries by region, allele frequency and variant's functional effect.
	concordance   : Concordance metrics between two VCF files.
	covMat        : Create an covariance matrix output (allele matrix as input).
	dbnsfp        : Annotate with multiple entries from dbNSFP.
	extractFields : Extract fields from VCF file into tab separated format.
	filter        : Filter using arbitrary expressions
	geneSets      : Annotate using MSigDb gene sets (MSigDb includes: GO, KEGG, Reactome, BioCarta, etc.)
	gt            : Add Genotype to INFO fields and remove genotype fields when possible.
	gtfilter      : Filter genotype using arbitrary expressions.
	gwasCat       : Annotate using GWAS catalog
	hwe           : Calculate Hardy-Weimberg parameters and perform a godness of fit test.
	intersect     : Intersect intervals (genomic regions).
	intervals     : Keep variants that intersect with intervals.
	intIdx        : Keep variants that intersect with intervals. Index-based method: Used for large VCF file and a few intervals to retrieve
	join          : Join files by genomic region.
	phastCons     : Annotate using conservation scores (phastCons).
	private       : Annotate if a variant is private to a family or group.
	rmRefGen      : Remove reference genotypes.
	rmInfo        : Remove INFO fields.
	split         : Split VCF by chromosome.
	tstv          : Calculate transiton to transversion ratio.
	varType       : Annotate variant type (SNP,MNP,INS,DEL or MIXED).
	vcfCheck      : Check that VCF file is well formed.
	vcf2tped      : Convert VCF to TPED.
Options common to all SnpSift commands:
	-d                   : Debug.
	-download            : Download database, if not available locally. Default: true.
	-noDownload          : Do not download a database, if not available locally.
	-noLog               : Do not report usage statistics to server.
	-h                   : Help.
	-v                   : Verbose.


Technical.png Version number reported for each tool correspond to the lowest version used to prepare this training and may differ from the current version. Older versions may as well do!

Technical.png Additional tools were installed from downloads or from source and are referenced to web links in the respective pages


People who have no experience with Linux command line should first follow the 'Linux for bioinformatics' training​ ([7]) or they will slow down the full class. All attendees who lack knowledge about Illumina reads should follow the preliminary one day 'Introduction to the analysis of NGS data' training ([8]).

Today's Hands-On Training

The Exercises

Code developed in these exercises

Most scripts included in the exercises below are reproduced and maintained on our dedicated GitHub page NGS-Variant-Analysis-training-2016 [9] (please feed-us back about any issue with this page).

Find more tools and answers

There are many tools out there, finding them is often the easiest part. You are welcome to try as many as you wish and improve results obtained with our selected toolbox. When seeking advice, please consider using:


Your feedback to this general command-line based NGS data analysis session is very important to us and will be used to improve this content for later sessions. If you need more training of this kind, please contact us and we will organise additional hands-on based on your requests. More advance sessions will depend on the availability of expert users within VIB that will accept to prepare specified material.

PS: We tried to cite resources and people that made this training possible. If you feel that some citations are missing, please inform us and we will correct this omission.

contact us

  1. http://www.1000genomes.org
  2. http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM18507
  3. http://www.gnu.org/software/coreutils/manual/coreutils.pdf
  4. http://picard.sourceforge.net
  5. https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf
  6. http://bedtools.readthedocs.org/
  7. http://wiki.bits.vib.be/index.php/Introduction_to_Linux_for_bioinformatics
  8. http://wiki.bits.vib.be/index.php/NGS_data_analysis
  9. https://github.com/BITS-VIB/NGS-Variant-Analysis-training-2016
  10. http://seqanswers.com SeqAnswers
  11. http://www.biostars.org BioStar
  12. http://stackoverflow.com stackoverflow
  13. http://tvap.genome.wustl.edu/
  14. http://bioinformatics.ca/links_directory
  15. https://broadinstitute.github.io/picard/explain-flags.html

[ Main_Page ]