Hands-on introduction to NGS variant analysis-2016

[ Main_Page | NGS_data_analysis ]
# one-day training 2016 session

This session is a simplified and rewritten version of a more complete and exploratory training given in 2013 and 2014 (Hands-on introduction to NGS variant analysis)

The most recent version of this training (2018) can be found at (Hands-on_introduction_to_NGS_variant_analysis-2018)

A PDF printout of the training pages was generated for your convenience. When possible, favour using the Wiki which may contain updated information as compared to the assembled document.

Aims of the NGS DNA variant analysis 1-day session

Using a full publicly available chromosome read-set from one of the 1000 genomes^[1] samples:

Perform a complete analysis workflow including QC, read mapping, read coverage analysis, and variant calling against the human reference genome.
Use command line and popular open source software to evaluate each step of a classical NGS variant workflow and feel the complexity of the task.
Compare the obtained results with gold standard public data.
Visualise selected results in IGV

The skills acquired during this session should allow reproduction of the workflow by participants using their own reads and a reasonably-sized personal computer running Unix.

All scripts as well as the most important result files are put online to allow offline practice to trainees and web visitors

More BITS Training Info

On the VIB website: http://www.vib.be/en/training/research-training/courses/Pages/Hands-On-introduction-to-NGS-variant-analysis.aspx
On the BITS website: https://www.bits.vib.be/index.php/training/201-ngs-variant-analysis

Summary

This training gives an introduction to the use of popular NGS analysis software packages at the command line. It reviews several exchangeable tools and provides hints to evaluate quality and content of Genome-Seq data. The code used in the training pages can easily be recycled and adapted, and will constitute the basis of your own workflows.

The sequencing data used in this session was obtained from gDNA extracted from EBV-transformed B-lymphocytes of a healthy Nigerian individual (NA18507). More information is available for that sample from the Coriell repository from which 1000g gDNA can be obtained ^[2].

This training does not cover all currently available methods. It does not aim at bringing users to a professional NGS analyst level but provides enough information to allow motivated biologists understand what DNA sequencing practically is, to read and reuse command-line code, and when necessary to communicate knowingly with NGS experts for more in-depth needs.

Prerequisites

Skills required to follow this training:

Linux command line basic skills are absolutely required to understand the code used throughout the session
basic knowledge of human genome structure and nomenclature is necessary to estimate the training tasks
basic knowledge of Illumina NGS read structure is also required for the same reason

Software used during this training:

All programs used in this training session were installed on the BITS laptops and are freely available for non-commercial use from links added to the training pages
Users with a minimal knowledge in unix command-line should be able to install the programs on their lab desktop or strong laptop computer to reproduce commands used in this tutorial

In addition to the many built-in unix command, originating from the GNU consortium for a good part (please read the GNU coreutils manual for more ^[3]), we will use today several dedicated toolboxes including:

• fastQC, the very popular NGS reads QC application

FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

fastqc seqfile1 seqfile2 .. seqfileN

fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
[-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

FastQC reads a set of sequence files and produces from each one a quality
control report consisting of a number of different modules, each one of
which will help to identify a different potential type of problem in your
data.

If no files to process are specified on the command line then the program
will start as an interactive graphical application. If files are provided
on the command line then the program will run with no user interaction
required. In this mode it is suitable for inclusion into a standardised
analysis pipeline.

The options for the program as as follows:

-h --help Print this help file and exit

-v --version Print the version of the program and exit

-o --outdir Create all output files in the specified output directory.
Please note that this directory must exist as the program
will not create it. If this option is not set then the
output file for each sequence file is created in the same
directory as the sequence file which was processed.

--casava Files come from raw casava output. Files in the same sample
group (differing only by the group number) will be analysed
as a set rather than individually. Sequences with the filter
flag set in the header will be excluded from the analysis.
Files must have the same names given to them by casava
(including being gzipped and ending with .gz) otherwise they
won't be grouped together correctly.

--nano Files come from naopore sequences and are in fast5 format. In
this mode you can pass in directories to process and the program
will take in all fast5 files within those directories and produce
a single output file from the sequences found in all files.

--nofilter If running with --casava then don't remove read flagged by
casava as poor quality when performing the QC analysis.

--extract If set then the zipped output file will be uncompressed in
the same directory after it has been created. By default
this option will be set if fastqc is run in non-interactive
mode.

-j --java Provides the full path to the java binary you want to use to
launch fastqc. If not supplied then java is assumed to be in
your path.

--noextract Do not uncompress the output file after creating it. You
should set this option if you do not wish to uncompress
the output when running in non-interactive mode.

--nogroup Disable grouping of bases for reads >50bp. All reports will
show data for every base in the read. WARNING: Using this
option will cause fastqc to crash and burn if you use it on
really long reads, and your plots may end up a ridiculous size.
You have been warned!

-f --format Bypasses the normal sequence file format detection and
forces the program to use the specified format. Valid
formats are bam,sam,bam_mapped,sam_mapped and fastq

-t --threads Specifies the number of files which can be processed
simultaneously. Each thread will be allocated 250MB of
memory so you shouldn't run more threads than your
available memory will cope with, and not more than
6 threads on a 32 bit machine

-c Specifies a non-default file which contains the list of
--contaminants contaminants to screen overrepresented sequences against.
The file must contain sets of named contaminants in the
form name[tab]sequence. Lines prefixed with a hash will
be ignored.

-a Specifies a non-default file which contains the list of
--adapters adapter sequences which will be explicity searched against
the library. The file must contain sets of named adapters
in the form name[tab]sequence. Lines prefixed with a hash
will be ignored.

-l Specifies a non-default file which contains a set of criteria
--limits which will be used to determine the warn/error limits for the
various modules. This file can also be used to selectively
remove some modules from the output all together. The format
needs to mirror the default limits.txt file found in the
Configuration folder.

-k --kmers Specifies the length of Kmer to look for in the Kmer content
module. Specified Kmer length must be between 2 and 10. Default
length is 7 if not specified.

-q --quiet Supress all progress messages on stdout and only report errors.

-d --dir Selects a directory to be used for temporary files written when
generating report images. Defaults to system temp directory if
not specified.

BUGS

Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk
or in www.bioinformatics.babraham.ac.uk/bugzilla/

• bwa, the Burrow Wheeler Aligner for mapping reads to the reference genome

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1039
Contact: Heng Li <lh3@sanger.ac.uk>

Usage: bwa <command> [options]

Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries

shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual.

• Picard (2.x) ^[4], the 'broadly' used java wrapper for samtools with added functionalities

## how to run Picard tools in recent versions ?
# provided PICARD is defined and points to the place where the picard.jar file is located, you can type:
java -jar $PICARD/picard.jar <command-name> <command-options...>

## current sorted list of Picard commands:
AddCommentsToBam.jar, AddOrReplaceReadGroups.jar, BamIndexStats.jar, BamToBfq.jar, BuildBamIndex.jar,
CalculateHsMetrics.jar, CheckIlluminaDirectory.jar, CleanSam.jar, CollectAlignmentSummaryMetrics.jar,
CollectGcBiasMetrics.jar, CollectInsertSizeMetrics.jar, CollectMultipleMetrics.jar, CollectRnaSeqMetrics.jar,
CollectTargetedPcrMetrics.jar, CollectWgsMetrics.jar, CompareSAMs.jar, CreateSequenceDictionary.jar,
DownsampleSam.jar, EstimateLibraryComplexity.jar, ExtractIlluminaBarcodes.jar, ExtractSequences.jar,
FastqToSam.jar, FilterSamReads.jar, FixMateInformation.jar, GatherBamFiles.jar, IlluminaBasecallsToFastq.jar,
IlluminaBasecallsToSam.jar, IntervalListTools.jar, MakeSitesOnlyVcf.jar, MarkDuplicates.jar, MarkIlluminaAdapters.jar,
MeanQualityByCycle.jar, MergeBamAlignment.jar, MergeSamFiles.jar, MergeVcfs.jar, NormalizeFasta.jar,
QualityScoreDistribution.jar, ReorderSam.jar, ReplaceSamHeader.jar, RevertOriginalBaseQualitiesAndAddMateCigar.jar,
RevertSam.jar, SamFormatConverter.jar, SamToFastq.jar, SortSam.jar, SplitVcfs.jar, ValidateSamFile.jar,
VcfFormatConverter.jar, ViewSam.jar

## get help in Picard by typing
java -jar $PICARD picard.jar <command> -h

• samtools 1.x, the new (htslib) version of samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.1 (using htslib 1.3.1)

Usage: samtools <command> [options]

Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
index index alignment

-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
rmdup remove PCR duplicates
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags

-- File operations
collate shuffle and group alignments by name
cat concatenate BAMs
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA

-- Statistics
bedcov read depth per BED region
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)

-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM

• bcftools 1.x (htslib), the samtools (htslib) accompanying toolbox

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
Version: 1.3.1 (using htslib 1.3.1)

Usage: bcftools [--version|--version-only] [--help] <command> <argument>

Commands:

-- Indexing
index index VCF/BCF files

-- VCF/BCF manipulation
annotate annotate and edit VCF/BCF files
concat concatenate VCF/BCF files from the same set of samples
convert convert VCF/BCF files to different formats and back
isec intersections of VCF/BCF files
merge merge VCF/BCF files files from non-overlapping sample sets
norm left-align and normalize indels
plugin user-defined plugins
query transform VCF/BCF into user-defined formats
reheader modify VCF/BCF header, change sample names
view VCF/BCF conversion, view, subset and filter VCF/BCF files

-- VCF/BCF analysis
call SNP/indel calling
consensus create consensus sequence by applying VCF variants
cnv HMM CNV calling
filter filter VCF/BCF files using fixed thresholds
gtcheck check sample concordance, detect sample swaps and contamination
roh identify runs of autozygosity (HMM)
stats produce VCF/BCF stats

Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
automatically even when streaming from a pipe. Indexed VCF and BCF will work
in all situations. Un-indexed VCF and BCF and streams will work in most but
not all situations.

• qualimap, evaluating next generation sequencing alignment data

#################################
## To show available tools use command:
 
qualimap --help
 
QualiMap v.2.2
Built on 2016-01-29 12:10
 
usage: qualimap <tool> [options]
 
To launch GUI leave <tool> empty.
 
Available tools:
 
    bamqc            Evaluate NGS mapping to a reference genome
    rnaseq           Evaluate RNA-seq alignment data
    counts           Counts data analysis (further RNA-seq data evaluation)
    multi-bamqc      Compare QC reports from multiple NGS mappings
    clustering       Cluster epigenomic signals
    comp-counts      Compute feature counts
 
Special arguments: 
 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
 
### BAM QC #######################################
## The following command allows to perform BAM QC analysis:
 
QualiMap v.2.2
Built on 2016-01-29 12:10
 
Selected tool: bamqc
 
ERROR: Unrecognized option: --help
 
usage: qualimap bamqc -bam <arg> [-c] [-gd <arg>] [-gff <arg>] [-hm <arg>] [-ip]
       [-nr <arg>] [-nt <arg>] [-nw <arg>] [-oc <arg>] [-os] [-outdir <arg>]
       [-outfile <arg>] [-outformat <arg>] [-p <arg>] [-sd] [-sdmode <arg>]
 -bam <arg>                           Input mapping file in BAM format
 -c,--paint-chromosome-limits         Paint chromosome limits inside charts
 -gd,--genome-gc-distr <arg>          Species to compare with genome GC
                                      distribution. Possible values: HUMAN -
                                      hg19; MOUSE - mm9(default), mm10
 -gff,--feature-file <arg>            Feature file with regions of interest in
                                      GFF/GTF or BED format
 -hm <arg>                            Minimum size for a homopolymer to be
                                      considered in indel analysis (default is
                                      3)
 -ip,--collect-overlap-pairs          Activate this option to collect statistics
                                      of overlapping paired-end reads
 -nr <arg>                            Number of reads analyzed in a chunk
                                      (default is 1000)
 -nt <arg>                            Number of threads (default is 4)
 -nw <arg>                            Number of windows (default is 400)
 -oc,--output-genome-coverage <arg>   File to save per base non-zero coverage.
                                      Warning: large files are expected for
                                      large genomes
 -os,--outside-stats                  Report information for the regions outside
                                      those defined by feature-file  (ignored
                                      when -gff option is not set)
 -outdir <arg>                        Output folder for HTML report and raw
                                      data.
 -outfile <arg>                       Output file for PDF report (default value
                                      is report.pdf).
 -outformat <arg>                     Format of the ouput report (PDF or HTML,
                                      default is HTML).
 -p,--sequencing-protocol <arg>       Sequencing library protocol:
                                      strand-specific-forward,
                                      strand-specific-reverse or
                                      non-strand-specific (default)
 -sd,--skip-duplicated                Activate this option to skip duplicated
                                      alignments from the analysis. If the
                                      duplicates are not flagged in the BAM
                                      file, then they will be detected by
                                      Qualimap and can be selected for skipping.
 -sdmode,--skip-dup-mode <arg>        Specific type of duplicated alignments to
                                      skip (if this option is activated).
                                      0 : only flagged duplicates (default)
                                      1 : only estimated by Qualimap
                                      2 : both flagged and estimated
 
 
 
Special arguments: 
 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
 
###  Counts QC  ##########################################################
## To perform counts QC analysis (evaluation of RNA-seq data) use the following command:
 
QualiMap v.2.2
Built on 2016-01-29 12:10
 
Selected tool: counts
 
ERROR: Unrecognized option: --help
 
usage: qualimap counts [-c] -d <arg> [-i <arg>] [-k <arg>] [-outdir <arg>]
       [-outfile <arg>] [-outformat <arg>] [-R <arg>] [-s <arg>]
 -c,--compare             Perform comparison of conditions. Currently 2 maximum
                          is possible.
 -d,--data <arg>          File describing the input data. Format of the file is
                          a 4-column tab-delimited table.
                          Column 1: sample name
                          Column 2: condition of the sample
                          Column 3: path to the counts data for the sample
                          Column 4: index of the column with counts
 -i,--info <arg>          Path to info file containing genes GC-content, length
                          and type.
 -k,--threshold <arg>     Threshold for the number of counts
 -outdir <arg>            Output folder for HTML report and raw data.
 -outfile <arg>           Output file for PDF report (default value is
                          report.pdf).
 -outformat <arg>         Format of the ouput report (PDF or HTML, default is
                          HTML).
 -R,--rscriptpath <arg>   Path to Rscript executable (by default it is assumed
                          to be available from system $PATH)
 -s,--species <arg>       Use built-in info file for the given species: HUMAN or
                          MOUSE.
 
 
 
Special arguments: 
 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
 
### Clustering ##############################################
## To perform clustering of epigenomic signals use the following command:
 
QualiMap v.2.2
Built on 2016-01-29 12:10
 
Selected tool: clustering
 
ERROR: Unrecognized option: --help
 
usage: qualimap clustering [-b <arg>] [-c <arg>] -control <arg> [-expr <arg>]
       [-f <arg>] [-l <arg>] [-name <arg>] [-outdir <arg>] [-outfile <arg>]
       [-outformat <arg>] [-R <arg>] [-r <arg>] -regions <arg> -sample <arg>
       [-viz <arg>]
 -b,--bin-size <arg>          Size of the bin (default is 100)
 -c,--clusters <arg>          Comma-separated list of cluster sizes
 -control <arg>               Comma-separated list of control BAM files
 -expr <arg>                  Name of the experiment
 -f,--fragment-length <arg>   Smoothing length of a fragment
 -l <arg>                     Upstream offset (default is 2000)
 -name <arg>                  Comma-separated names of the replicates
 -outdir <arg>                Output folder for HTML report and raw data.
 -outfile <arg>               Output file for PDF report (default value is
                              report.pdf).
 -outformat <arg>             Format of the ouput report (PDF or HTML, default
                              is HTML).
 -R,--rscriptpath <arg>       Path to Rscript executable (by default it is
                              assumed to be available from system $PATH)
 -r <arg>                     Downstream offset (default is 500)
 -regions <arg>               Path to regions file
 -sample <arg>                Comma-separated list of sample BAM files
 -viz <arg>                   Visualization type: heatmap or line
 
 
 
Special arguments: 
 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
 
### Compute counts ######################################
## To compute counts from mapping data use the following command:
 
QualiMap v.2.2
Built on 2016-01-29 12:10
 
Selected tool: comp-counts
 
ERROR: Unrecognized option: --help
 
usage: qualimap comp-counts [-a <arg>] -bam <arg> -gtf <arg> [-id <arg>] [-out
       <arg>] [-p <arg>] [-pe] [-s] [-type <arg>]
 -a,--algorithm <arg>             Counting algorithm:
                                  uniquely-mapped-reads(default) or proportional
 -bam <arg>                       Mapping file in BAM format
 -gtf <arg>                       Region file in GTF, GFF or BED format. If GTF
                                  format is provided, counting is based on
                                  attributes, otherwise based on feature name
 -id <arg>                        GTF-specific. Attribute of the GTF to be used
                                  as feature ID. Regions with the same ID will
                                  be aggregated as part of the same feature.
                                  Default: gene_id.
 -out <arg>                       Output file of coverage report.
 -p,--sequencing-protocol <arg>   Sequencing library protocol:
                                  strand-specific-forward,
                                  strand-specific-reverse or non-strand-specific
                                  (default)
 -pe,--paired                     Setting this flag for paired-end experiments
                                  will result in counting fragments instead of
                                  reads
 -s,--sorted                      This flag indicates that the input file is
                                  already sorted by name. If not set, additional
                                  sorting by name will be performed. Only
                                  required for paired-end analysis.
 -type <arg>                      GTF-specific. Value of the third column of the
                                  GTF considered for counting. Other types will
                                  be ignored. Default: exon
 
 
 
Special arguments: 
 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
 
## also see 'rnaseq' and 'multi-bamqc'

• bedtools, a multiusage BED/SAM/VCF manipulation suite

Get the full BedTools PDF manual from https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf ^[5]

Read it online at http://bedtools.readthedocs.org/ ^[6]

bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
usage: bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]
intersect Find overlapping intervals in various ways.
window Find overlapping intervals within a window around an interval.
closest Find the closest, potentially non-overlapping interval.
coverage Compute the coverage over defined intervals.
map Apply a function to a column for each overlapping interval.
genomecov Compute the coverage over an entire genome.
merge Combine overlapping/nearby intervals into a single interval.
cluster Cluster (but don't merge) overlapping/nearby intervals.
complement Extract intervals _not_ represented by an interval file.
subtract Remove intervals based on overlaps b/w two files.
slop Adjust the size of intervals.
flank Create new intervals from the flanks of existing intervals.
sort Order the intervals in a file.
random Generate random intervals in a genome.
shuffle Randomly redistrubute intervals in a genome.
sample Sample random records from file using reservoir sampling.
spacing Report the gap lengths between intervals in a file.
annotate Annotate coverage of features from multiple files.

[ Multi-way file comparisons ]
multiinter Identifies common intervals among multiple interval files.
unionbedg Combines coverage intervals from multiple BEDGRAPH files.

[ Paired-end manipulation ]
pairtobed Find pairs that overlap intervals in various ways.
pairtopair Find pairs that overlap other pairs in various ways.

[ Format conversion ]
bamtobed Convert BAM alignments to BED (& other) formats.
bedtobam Convert intervals to BAM records.
bamtofastq Convert BAM records to FASTQ records.
bedpetobam Convert BEDPE intervals to BAM records.
bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.

[ Fasta manipulation ]
getfasta Use intervals to extract sequences from a FASTA file.
maskfasta Use intervals to mask sequences from a FASTA file.
nuc Profile the nucleotide content of intervals in a FASTA file.

[ BAM focused tools ]
multicov Counts coverage from multiple BAMs at specific intervals.
tag Tag BAM alignments based on overlaps with interval files.

[ Statistical relationships ]
jaccard Calculate the Jaccard statistic b/w two sets of intervals.
reldist Calculate the distribution of relative distances b/w two files.
fisher Calculate Fisher statistic b/w two feature files.

[ Miscellaneous tools ]
overlap Computes the amount of overlap from two intervals.
igv Create an IGV snapshot batch script.
links Create a HTML page of links to UCSC locations.
makewindows Make interval "windows" across a genome.
groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
expand Replicate lines based on lists of values in columns.
split Split a file into multiple files with equal records or base pairs.

[ General help ]
--help Print this help menu.
--version What version of bedtools are you using?.
--contact Feature requests, bugs, mailing lists, etc.

• IGVtools, the toolbox used to prepare files for IGV (compress, index, and more)

Program: igvtools. IGV Version 2.3.72 (89)04/01/2016 11:43 AM

Usage: igvtools [command] [options] [input file/dir] [other arguments]

Command: version print the version number
sort sort an alignment file by start position.
index index an alignment file
toTDF convert an input file (cn, gct, wig) to tiled data format (tdf)
count compute coverage density for an alignment file
formatexp center, scale, and log2 normalize an expression file
gui Start the gui
help <command> display this help message, or help on a specific command
See http://www.broadinstitute.org/software/igv/igvtools_commandline for more detailed help

• snpEff commands used to add annotations to VCF data

java -jar $SNPEFF/snpEff.jar -h
SnpEff version SnpEff 4.2 (build 2015-12-05), by Pablo Cingolani
Usage: snpEff [command] [options] [files]
 
Run 'java -jar snpEff.jar command' for help on each specific command
 
Available commands: 
	[eff|ann]                    : Annotate variants / calculate effects (you can use either 'ann' or 'eff', they mean the same). Default: ann (no command or 'ann').
	build                        : Build a SnpEff database.
	buildNextProt                : Build a SnpEff for NextProt (using NextProt's XML files).
	cds                          : Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness.
	closest                      : Annotate the closest genomic region.
	count                        : Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval.
	databases                    : Show currently available databases (from local config file).
	download                     : Download a SnpEff database.
	dump                         : Dump to STDOUT a SnpEff database (mostly used for debugging).
	genes2bed                    : Create a bed file from a genes list.
	len                          : Calculate total genomic length for each marker type.
	pdb                          : Build interaction database (based on PDB data).
	protein                      : Compare protein sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness.
	show                         : Show a text representation of genes or transcripts coordiantes, DNA sequence and protein sequence.
 
Generic options:
	-c , -config                 : Specify config file
	-configOption name=value     : Override a config file option
	-d , -debug                  : Debug mode (very verbose).
	-dataDir <path>              : Override data_dir parameter from config file.
	-download                    : Download a SnpEff database, if not available locally. Default: true
	-nodownload                  : Do not download a SnpEff database, if not available locally.
	-h , -help                   : Show this help and exit
	-noLog                       : Do not report usage statistics to server
	-t                           : Use multiple threads (implies '-noStats'). Default 'off'
	-q , -quiet                  : Quiet mode (do not show any messages or errors)
	-v , -verbose                : Verbose mode
	-version                     : Show version number and exit
 
Database options:
	-canon                       : Only use canonical transcripts.
	-interaction                 : Annotate using inteactions (requires interaciton database). Default: true
	-interval <file>             : Use a custom intervals in TXT/BED/BigBed/VCF/GFF file (you may use this option many times)
	-maxTSL <TSL_number>         : Only use transcripts having Transcript Support Level lower than <TSL_number>.
	-motif                       : Annotate using motifs (requires Motif database). Default: true
	-nextProt                    : Annotate using NextProt (requires NextProt database).
	-noGenome                    : Do not load any genomic database (e.g. annotate using custom files).
	-noInteraction               : Disable inteaction annotations
	-noMotif                     : Disable motif annotations.
	-noNextProt                  : Disable NextProt annotations.
	-onlyReg                     : Only use regulation tracks.
	-onlyProtein                 : Only use protein coding transcripts. Default: false
	-onlyTr <file.txt>           : Only use the transcripts in this file. Format: One transcript ID per line.
	-reg <name>                  : Regulation track to use (this option can be used add several times).
	-ss , -spliceSiteSize <int>  : Set size for splice sites (donor and acceptor) in bases. Default: 2
	-spliceRegionExonSize <int>  : Set size for splice site region within exons. Default: 3 bases
	-spliceRegionIntronMin <int> : Set minimum number of bases for splice site region within intron. Default: 3 bases
	-spliceRegionIntronMax <int> : Set maximum number of bases for splice site region within intron. Default: 8 bases
	-strict                      : Only use 'validated' transcripts (i.e. sequence has been checked). Default: false
	-ud , -upDownStreamLen <int> : Set upstream downstream interval length (in bases)

• snpSift commands used to filter snpEff-annotated VCF data

java -jar $SNPEFF/SnpSift.jar -h
SnpSift version 4.2 (build 2015-12-05), by Pablo Cingolani
Usage: java -jar SnpSift.jar [command] params...
Command is one of:
alleleMat : Create an allele matrix output.
annotate : Annotate 'ID' from a database (e.g. dbSnp). Assumes entries are sorted.
annMem : Annotate 'ID' from a database (e.g. dbSnp). Loads db in memory. Does not assume sorted entries.
caseControl : Compare how many variants are in 'case' and in 'control' groups; calculate p-values.
ccs : Case control summary. Case and control summaries by region, allele frequency and variant's functional effect.
concordance : Concordance metrics between two VCF files.
covMat : Create an covariance matrix output (allele matrix as input).
dbnsfp : Annotate with multiple entries from dbNSFP.
extractFields : Extract fields from VCF file into tab separated format.
filter : Filter using arbitrary expressions
geneSets : Annotate using MSigDb gene sets (MSigDb includes: GO, KEGG, Reactome, BioCarta, etc.)
gt : Add Genotype to INFO fields and remove genotype fields when possible.
gtfilter : Filter genotype using arbitrary expressions.
gwasCat : Annotate using GWAS catalog
hwe : Calculate Hardy-Weimberg parameters and perform a godness of fit test.
intersect : Intersect intervals (genomic regions).
intervals : Keep variants that intersect with intervals.
intIdx : Keep variants that intersect with intervals. Index-based method: Used for large VCF file and a few intervals to retrieve
join : Join files by genomic region.
phastCons : Annotate using conservation scores (phastCons).
private : Annotate if a variant is private to a family or group.
rmRefGen : Remove reference genotypes.
rmInfo : Remove INFO fields.
split : Split VCF by chromosome.
tstv : Calculate transiton to transversion ratio.
varType : Annotate variant type (SNP,MNP,INS,DEL or MIXED).
vcfCheck : Check that VCF file is well formed.
vcf2tped : Convert VCF to TPED.
Options common to all SnpSift commands:
-d : Debug.
-download : Download database, if not available locally. Default: true.
-noDownload : Do not download a database, if not available locally.
-noLog : Do not report usage statistics to server.
-h : Help.
-v : Verbose.

Version number reported for each tool correspond to the lowest version used to prepare this training and may differ from the current version. Older versions may as well do!

Additional tools were installed from downloads or from source and are referenced to web links in the respective pages

People who have no experience with Linux command line should first follow the 'Linux for bioinformatics' training (^[7]) or they will slow down the full class. All attendees who lack knowledge about Illumina reads should follow the preliminary one day 'Introduction to the analysis of NGS data' training (^[8]).

Today's Hands-On Training

The Exercises

NGS-Var Exercise.1: QC paired end reads using fastQC
NGS-Var Exercise.2: Map a sample of the original paired end reads to the human reference genome hg19 using the Burrow Wheeler Aligner (BWA)
NGS-Var Exercise.3: BWA mapping QC using Samtools, Picard, and Qualimap
NGS-Var Exercise.4: Call variants as compared to the human reference genome (hg19) with samtools|bcftools or samtools|varscan
NGS-Var Exercise.5: Compare and intersect VCF files using VcfTools and bedtools
NGS-Var Exercise.6: Annotate and filter VCF variant lists with SNPSift and SnpEff
NGS-Var Exercise.7: Review variants and annotations in IGV

Code developed in these exercises

Most scripts included in the exercises below are reproduced and maintained on our dedicated GitHub page NGS-Variant-Analysis-training-2016 ^[9] (please feed-us back about any issue with this page).

Find more tools and answers

There are many tools out there, finding them is often the easiest part. You are welcome to try as many as you wish and improve results obtained with our selected toolbox. When seeking advice, please consider using:

SeqAnswers^[10] for all what relates to NGS
BioStar^[11] for questions about biocomputing and scripting for biologists
stackoverflow^[12] for questions related to coding.

The Turnkey Variant Analysis Project^[13] to find Variant analysis resources.
bioinformatics.ca directory^[14] to find bioinformatics tools sorted by categories.

The EBI online training propose many very nice training sessions with slides and exercises.

explain SAM BAM flags^[15] (mirrored at BITS)

Conclusion

Your feedback to this general command-line based NGS data analysis session is very important to us and will be used to improve this content for later sessions. If you need more training of this kind, please contact us and we will organise additional hands-on based on your requests. More advance sessions will depend on the availability of expert users within VIB that will accept to prepare specified material.

PS: We tried to cite resources and people that made this training possible. If you feel that some citations are missing, please inform us and we will correct this omission.

contact us

References:

[ Main_Page ]

[1] ttp://www.1000genomes.org

[2] ttp://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM18507

[3] ttp://www.gnu.org/software/coreutils/manual/coreutils.pdf

[4] ttp://picard.sourceforge.net

[5] ttps://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf

[6] ttp://bedtools.readthedocs.org/

[7] ttp://wiki.bits.vib.be/index.php/Introduction_to_Linux_for_bioinformatics

[8] ttp://wiki.bits.vib.be/index.php/NGS_data_analysis

[9] ttps://github.com/BITS-VIB/NGS-Variant-Analysis-training-2016

[10] ttp://seqanswers.com SeqAnswers

[11] ttp://www.biostars.org BioStar

[12] ttp://stackoverflow.com stackoverflow

[13] ttp://tvap.genome.wustl.edu/

[14] ttp://bioinformatics.ca/links_directory

[15] ttps://broadinstitute.github.io/picard/explain-flags.html

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

Hands-on introduction to NGS variant analysis-2016

Contents

Aims of the NGS DNA variant analysis 1-day session

More BITS Training Info

Summary

Prerequisites

Today's Hands-On Training

The Exercises

Code developed in these exercises

Find more tools and answers

Conclusion

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox