Hands-on introduction to NGS variant analysis-2016
[ Main_Page | NGS_data_analysis ]
# one-day training 2016 session
This session is a simplified and rewritten version of a more complete and exploratory training given in 2013 and 2014 (Hands-on introduction to NGS variant analysis)
The most recent version of this training (2018) can be found at (Hands-on_introduction_to_NGS_variant_analysis-2018)
A PDF printout of the training pages was generated for your convenience. When possible, favour using the Wiki which may contain updated information as compared to the assembled document.
Contents
Aims of the NGS DNA variant analysis 1-day session
Using a full publicly available chromosome read-set from one of the 1000 genomes[1] samples:
- Perform a complete analysis workflow including QC, read mapping, read coverage analysis, and variant calling against the human reference genome.
- Use command line and popular open source software to evaluate each step of a classical NGS variant workflow and feel the complexity of the task.
- Compare the obtained results with gold standard public data.
- Visualise selected results in IGV
The skills acquired during this session should allow reproduction of the workflow by participants using their own reads and a reasonably-sized personal computer running Unix.
All scripts as well as the most important result files are put online to allow offline practice to trainees and web visitors
More BITS Training Info
- On the VIB website: http://www.vib.be/en/training/research-training/courses/Pages/Hands-On-introduction-to-NGS-variant-analysis.aspx
- On the BITS website: https://www.bits.vib.be/index.php/training/201-ngs-variant-analysis
Summary
This training gives an introduction to the use of popular NGS analysis software packages at the command line. It reviews several exchangeable tools and provides hints to evaluate quality and content of Genome-Seq data. The code used in the training pages can easily be recycled and adapted, and will constitute the basis of your own workflows.
The sequencing data used in this session was obtained from gDNA extracted from EBV-transformed B-lymphocytes of a healthy Nigerian individual (NA18507). More information is available for that sample from the Coriell repository from which 1000g gDNA can be obtained [2].
This training does not cover all currently available methods. It does not aim at bringing users to a professional NGS analyst level but provides enough information to allow motivated biologists understand what DNA sequencing practically is, to read and reuse command-line code, and when necessary to communicate knowingly with NGS experts for more in-depth needs.
Prerequisites
Skills required to follow this training:
- Linux command line basic skills are absolutely required to understand the code used throughout the session
- basic knowledge of human genome structure and nomenclature is necessary to estimate the training tasks
- basic knowledge of Illumina NGS read structure is also required for the same reason
Software used during this training:
- All programs used in this training session were installed on the BITS laptops and are freely available for non-commercial use from links added to the training pages
- Users with a minimal knowledge in unix command-line should be able to install the programs on their lab desktop or strong laptop computer to reproduce commands used in this tutorial
In addition to the many built-in unix command, originating from the GNU consortium for a good part (please read the GNU coreutils manual for more [3]), we will use today several dedicated toolboxes including:
• fastQC, the very popular NGS reads QC application
SYNOPSIS
fastqc seqfile1 seqfile2 .. seqfileN
fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
[-c contaminant file] seqfile1 .. seqfileN
DESCRIPTION
FastQC reads a set of sequence files and produces from each one a quality
control report consisting of a number of different modules, each one of
which will help to identify a different potential type of problem in your
data.
If no files to process are specified on the command line then the program
will start as an interactive graphical application. If files are provided
on the command line then the program will run with no user interaction
required. In this mode it is suitable for inclusion into a standardised
analysis pipeline.
The options for the program as as follows:
-h --help Print this help file and exit
-v --version Print the version of the program and exit
-o --outdir Create all output files in the specified output directory.
Please note that this directory must exist as the program
will not create it. If this option is not set then the
output file for each sequence file is created in the same
directory as the sequence file which was processed.
--casava Files come from raw casava output. Files in the same sample
group (differing only by the group number) will be analysed
as a set rather than individually. Sequences with the filter
flag set in the header will be excluded from the analysis.
Files must have the same names given to them by casava
(including being gzipped and ending with .gz) otherwise they
won't be grouped together correctly.
--nano Files come from naopore sequences and are in fast5 format. In
this mode you can pass in directories to process and the program
will take in all fast5 files within those directories and produce
a single output file from the sequences found in all files.
--nofilter If running with --casava then don't remove read flagged by
casava as poor quality when performing the QC analysis.
--extract If set then the zipped output file will be uncompressed in
the same directory after it has been created. By default
this option will be set if fastqc is run in non-interactive
mode.
-j --java Provides the full path to the java binary you want to use to
launch fastqc. If not supplied then java is assumed to be in
your path.
--noextract Do not uncompress the output file after creating it. You
should set this option if you do not wish to uncompress
the output when running in non-interactive mode.
--nogroup Disable grouping of bases for reads >50bp. All reports will
show data for every base in the read. WARNING: Using this
option will cause fastqc to crash and burn if you use it on
really long reads, and your plots may end up a ridiculous size.
You have been warned!
-f --format Bypasses the normal sequence file format detection and
forces the program to use the specified format. Valid
formats are bam,sam,bam_mapped,sam_mapped and fastq
-t --threads Specifies the number of files which can be processed
simultaneously. Each thread will be allocated 250MB of
memory so you shouldn't run more threads than your
available memory will cope with, and not more than
6 threads on a 32 bit machine
-c Specifies a non-default file which contains the list of
--contaminants contaminants to screen overrepresented sequences against.
The file must contain sets of named contaminants in the
form name[tab]sequence. Lines prefixed with a hash will
be ignored.
-a Specifies a non-default file which contains the list of
--adapters adapter sequences which will be explicity searched against
the library. The file must contain sets of named adapters
in the form name[tab]sequence. Lines prefixed with a hash
will be ignored.
-l Specifies a non-default file which contains a set of criteria
--limits which will be used to determine the warn/error limits for the
various modules. This file can also be used to selectively
remove some modules from the output all together. The format
needs to mirror the default limits.txt file found in the
Configuration folder.
-k --kmers Specifies the length of Kmer to look for in the Kmer content
module. Specified Kmer length must be between 2 and 10. Default
length is 7 if not specified.
-q --quiet Supress all progress messages on stdout and only report errors.
-d --dir Selects a directory to be used for temporary files written when
generating report images. Defaults to system temp directory if
not specified.
BUGS
Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk
or in www.bioinformatics.babraham.ac.uk/bugzilla/
• bwa, the Burrow Wheeler Aligner for mapping reads to the reference genome
Version: 0.7.12-r1039
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual.
• Picard (2.x) [4], the 'broadly' used java wrapper for samtools with added functionalities
# provided PICARD is defined and points to the place where the picard.jar file is located, you can type:
java -jar $PICARD/picard.jar <command-name> <command-options...>
## current sorted list of Picard commands:
AddCommentsToBam.jar, AddOrReplaceReadGroups.jar, BamIndexStats.jar, BamToBfq.jar, BuildBamIndex.jar,
CalculateHsMetrics.jar, CheckIlluminaDirectory.jar, CleanSam.jar, CollectAlignmentSummaryMetrics.jar,
CollectGcBiasMetrics.jar, CollectInsertSizeMetrics.jar, CollectMultipleMetrics.jar, CollectRnaSeqMetrics.jar,
CollectTargetedPcrMetrics.jar, CollectWgsMetrics.jar, CompareSAMs.jar, CreateSequenceDictionary.jar,
DownsampleSam.jar, EstimateLibraryComplexity.jar, ExtractIlluminaBarcodes.jar, ExtractSequences.jar,
FastqToSam.jar, FilterSamReads.jar, FixMateInformation.jar, GatherBamFiles.jar, IlluminaBasecallsToFastq.jar,
IlluminaBasecallsToSam.jar, IntervalListTools.jar, MakeSitesOnlyVcf.jar, MarkDuplicates.jar, MarkIlluminaAdapters.jar,
MeanQualityByCycle.jar, MergeBamAlignment.jar, MergeSamFiles.jar, MergeVcfs.jar, NormalizeFasta.jar,
QualityScoreDistribution.jar, ReorderSam.jar, ReplaceSamHeader.jar, RevertOriginalBaseQualitiesAndAddMateCigar.jar,
RevertSam.jar, SamFormatConverter.jar, SamToFastq.jar, SortSam.jar, SplitVcfs.jar, ValidateSamFile.jar,
VcfFormatConverter.jar, ViewSam.jar
## get help in Picard by typing
java -jar $PICARD picard.jar <command> -h
• samtools 1.x, the new (htslib) version of samtools
Version: 1.3.1 (using htslib 1.3.1)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
index index alignment
-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
rmdup remove PCR duplicates
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
-- File operations
collate shuffle and group alignments by name
cat concatenate BAMs
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA
-- Statistics
bedcov read depth per BED region
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)
-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
• bcftools 1.x (htslib), the samtools (htslib) accompanying toolbox
Version: 1.3.1 (using htslib 1.3.1)
Usage: bcftools [--version|--version-only] [--help] <command> <argument>
Commands:
-- Indexing
index index VCF/BCF files
-- VCF/BCF manipulation
annotate annotate and edit VCF/BCF files
concat concatenate VCF/BCF files from the same set of samples
convert convert VCF/BCF files to different formats and back
isec intersections of VCF/BCF files
merge merge VCF/BCF files files from non-overlapping sample sets
norm left-align and normalize indels
plugin user-defined plugins
query transform VCF/BCF into user-defined formats
reheader modify VCF/BCF header, change sample names
view VCF/BCF conversion, view, subset and filter VCF/BCF files
-- VCF/BCF analysis
call SNP/indel calling
consensus create consensus sequence by applying VCF variants
cnv HMM CNV calling
filter filter VCF/BCF files using fixed thresholds
gtcheck check sample concordance, detect sample swaps and contamination
roh identify runs of autozygosity (HMM)
stats produce VCF/BCF stats
Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
automatically even when streaming from a pipe. Indexed VCF and BCF will work
in all situations. Un-indexed VCF and BCF and streams will work in most but
not all situations.
• qualimap, evaluating next generation sequencing alignment data
################################# ## To show available tools use command: qualimap --help QualiMap v.2.2 Built on 2016-01-29 12:10 usage: qualimap <tool> [options] To launch GUI leave <tool> empty. Available tools: bamqc Evaluate NGS mapping to a reference genome rnaseq Evaluate RNA-seq alignment data counts Counts data analysis (further RNA-seq data evaluation) multi-bamqc Compare QC reports from multiple NGS mappings clustering Cluster epigenomic signals comp-counts Compute feature counts Special arguments: --java-mem-size Use this argument to set Java memory heap size. Example: qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G ### BAM QC ####################################### ## The following command allows to perform BAM QC analysis: QualiMap v.2.2 Built on 2016-01-29 12:10 Selected tool: bamqc ERROR: Unrecognized option: --help usage: qualimap bamqc -bam <arg> [-c] [-gd <arg>] [-gff <arg>] [-hm <arg>] [-ip] [-nr <arg>] [-nt <arg>] [-nw <arg>] [-oc <arg>] [-os] [-outdir <arg>] [-outfile <arg>] [-outformat <arg>] [-p <arg>] [-sd] [-sdmode <arg>] -bam <arg> Input mapping file in BAM format -c,--paint-chromosome-limits Paint chromosome limits inside charts -gd,--genome-gc-distr <arg> Species to compare with genome GC distribution. Possible values: HUMAN - hg19; MOUSE - mm9(default), mm10 -gff,--feature-file <arg> Feature file with regions of interest in GFF/GTF or BED format -hm <arg> Minimum size for a homopolymer to be considered in indel analysis (default is 3) -ip,--collect-overlap-pairs Activate this option to collect statistics of overlapping paired-end reads -nr <arg> Number of reads analyzed in a chunk (default is 1000) -nt <arg> Number of threads (default is 4) -nw <arg> Number of windows (default is 400) -oc,--output-genome-coverage <arg> File to save per base non-zero coverage. Warning: large files are expected for large genomes -os,--outside-stats Report information for the regions outside those defined by feature-file (ignored when -gff option is not set) -outdir <arg> Output folder for HTML report and raw data. -outfile <arg> Output file for PDF report (default value is report.pdf). -outformat <arg> Format of the ouput report (PDF or HTML, default is HTML). -p,--sequencing-protocol <arg> Sequencing library protocol: strand-specific-forward, strand-specific-reverse or non-strand-specific (default) -sd,--skip-duplicated Activate this option to skip duplicated alignments from the analysis. If the duplicates are not flagged in the BAM file, then they will be detected by Qualimap and can be selected for skipping. -sdmode,--skip-dup-mode <arg> Specific type of duplicated alignments to skip (if this option is activated). 0 : only flagged duplicates (default) 1 : only estimated by Qualimap 2 : both flagged and estimated Special arguments: --java-mem-size Use this argument to set Java memory heap size. Example: qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G ### Counts QC ########################################################## ## To perform counts QC analysis (evaluation of RNA-seq data) use the following command: QualiMap v.2.2 Built on 2016-01-29 12:10 Selected tool: counts ERROR: Unrecognized option: --help usage: qualimap counts [-c] -d <arg> [-i <arg>] [-k <arg>] [-outdir <arg>] [-outfile <arg>] [-outformat <arg>] [-R <arg>] [-s <arg>] -c,--compare Perform comparison of conditions. Currently 2 maximum is possible. -d,--data <arg> File describing the input data. Format of the file is a 4-column tab-delimited table. Column 1: sample name Column 2: condition of the sample Column 3: path to the counts data for the sample Column 4: index of the column with counts -i,--info <arg> Path to info file containing genes GC-content, length and type. -k,--threshold <arg> Threshold for the number of counts -outdir <arg> Output folder for HTML report and raw data. -outfile <arg> Output file for PDF report (default value is report.pdf). -outformat <arg> Format of the ouput report (PDF or HTML, default is HTML). -R,--rscriptpath <arg> Path to Rscript executable (by default it is assumed to be available from system $PATH) -s,--species <arg> Use built-in info file for the given species: HUMAN or MOUSE. Special arguments: --java-mem-size Use this argument to set Java memory heap size. Example: qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G ### Clustering ############################################## ## To perform clustering of epigenomic signals use the following command: QualiMap v.2.2 Built on 2016-01-29 12:10 Selected tool: clustering ERROR: Unrecognized option: --help usage: qualimap clustering [-b <arg>] [-c <arg>] -control <arg> [-expr <arg>] [-f <arg>] [-l <arg>] [-name <arg>] [-outdir <arg>] [-outfile <arg>] [-outformat <arg>] [-R <arg>] [-r <arg>] -regions <arg> -sample <arg> [-viz <arg>] -b,--bin-size <arg> Size of the bin (default is 100) -c,--clusters <arg> Comma-separated list of cluster sizes -control <arg> Comma-separated list of control BAM files -expr <arg> Name of the experiment -f,--fragment-length <arg> Smoothing length of a fragment -l <arg> Upstream offset (default is 2000) -name <arg> Comma-separated names of the replicates -outdir <arg> Output folder for HTML report and raw data. -outfile <arg> Output file for PDF report (default value is report.pdf). -outformat <arg> Format of the ouput report (PDF or HTML, default is HTML). -R,--rscriptpath <arg> Path to Rscript executable (by default it is assumed to be available from system $PATH) -r <arg> Downstream offset (default is 500) -regions <arg> Path to regions file -sample <arg> Comma-separated list of sample BAM files -viz <arg> Visualization type: heatmap or line Special arguments: --java-mem-size Use this argument to set Java memory heap size. Example: qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G ### Compute counts ###################################### ## To compute counts from mapping data use the following command: QualiMap v.2.2 Built on 2016-01-29 12:10 Selected tool: comp-counts ERROR: Unrecognized option: --help usage: qualimap comp-counts [-a <arg>] -bam <arg> -gtf <arg> [-id <arg>] [-out <arg>] [-p <arg>] [-pe] [-s] [-type <arg>] -a,--algorithm <arg> Counting algorithm: uniquely-mapped-reads(default) or proportional -bam <arg> Mapping file in BAM format -gtf <arg> Region file in GTF, GFF or BED format. If GTF format is provided, counting is based on attributes, otherwise based on feature name -id <arg> GTF-specific. Attribute of the GTF to be used as feature ID. Regions with the same ID will be aggregated as part of the same feature. Default: gene_id. -out <arg> Output file of coverage report. -p,--sequencing-protocol <arg> Sequencing library protocol: strand-specific-forward, strand-specific-reverse or non-strand-specific (default) -pe,--paired Setting this flag for paired-end experiments will result in counting fragments instead of reads -s,--sorted This flag indicates that the input file is already sorted by name. If not set, additional sorting by name will be performed. Only required for paired-end analysis. -type <arg> GTF-specific. Value of the third column of the GTF considered for counting. Other types will be ignored. Default: exon Special arguments: --java-mem-size Use this argument to set Java memory heap size. Example: qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G ## also see 'rnaseq' and 'multi-bamqc'
• bedtools, a multiusage BED/SAM/VCF manipulation suite
Read it online at http://bedtools.readthedocs.org/ [6]
usage: bedtools <subcommand> [options]
The bedtools sub-commands include:
[ Genome arithmetic ]
intersect Find overlapping intervals in various ways.
window Find overlapping intervals within a window around an interval.
closest Find the closest, potentially non-overlapping interval.
coverage Compute the coverage over defined intervals.
map Apply a function to a column for each overlapping interval.
genomecov Compute the coverage over an entire genome.
merge Combine overlapping/nearby intervals into a single interval.
cluster Cluster (but don't merge) overlapping/nearby intervals.
complement Extract intervals _not_ represented by an interval file.
subtract Remove intervals based on overlaps b/w two files.
slop Adjust the size of intervals.
flank Create new intervals from the flanks of existing intervals.
sort Order the intervals in a file.
random Generate random intervals in a genome.
shuffle Randomly redistrubute intervals in a genome.
sample Sample random records from file using reservoir sampling.
spacing Report the gap lengths between intervals in a file.
annotate Annotate coverage of features from multiple files.
[ Multi-way file comparisons ]
multiinter Identifies common intervals among multiple interval files.
unionbedg Combines coverage intervals from multiple BEDGRAPH files.
[ Paired-end manipulation ]
pairtobed Find pairs that overlap intervals in various ways.
pairtopair Find pairs that overlap other pairs in various ways.
[ Format conversion ]
bamtobed Convert BAM alignments to BED (& other) formats.
bedtobam Convert intervals to BAM records.
bamtofastq Convert BAM records to FASTQ records.
bedpetobam Convert BEDPE intervals to BAM records.
bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.
[ Fasta manipulation ]
getfasta Use intervals to extract sequences from a FASTA file.
maskfasta Use intervals to mask sequences from a FASTA file.
nuc Profile the nucleotide content of intervals in a FASTA file.
[ BAM focused tools ]
multicov Counts coverage from multiple BAMs at specific intervals.
tag Tag BAM alignments based on overlaps with interval files.
[ Statistical relationships ]
jaccard Calculate the Jaccard statistic b/w two sets of intervals.
reldist Calculate the distribution of relative distances b/w two files.
fisher Calculate Fisher statistic b/w two feature files.
[ Miscellaneous tools ]
overlap Computes the amount of overlap from two intervals.
igv Create an IGV snapshot batch script.
links Create a HTML page of links to UCSC locations.
makewindows Make interval "windows" across a genome.
groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
expand Replicate lines based on lists of values in columns.
split Split a file into multiple files with equal records or base pairs.
[ General help ]
--help Print this help menu.
--version What version of bedtools are you using?.
--contact Feature requests, bugs, mailing lists, etc.
• IGVtools, the toolbox used to prepare files for IGV (compress, index, and more)
Usage: igvtools [command] [options] [input file/dir] [other arguments]
Command: version print the version number
sort sort an alignment file by start position.
index index an alignment file
toTDF convert an input file (cn, gct, wig) to tiled data format (tdf)
count compute coverage density for an alignment file
formatexp center, scale, and log2 normalize an expression file
gui Start the gui
help <command> display this help message, or help on a specific command
See http://www.broadinstitute.org/software/igv/igvtools_commandline for more detailed help
• snpEff commands used to add annotations to VCF data
java -jar $SNPEFF/snpEff.jar -h SnpEff version SnpEff 4.2 (build 2015-12-05), by Pablo Cingolani Usage: snpEff [command] [options] [files] Run 'java -jar snpEff.jar command' for help on each specific command Available commands: [eff|ann] : Annotate variants / calculate effects (you can use either 'ann' or 'eff', they mean the same). Default: ann (no command or 'ann'). build : Build a SnpEff database. buildNextProt : Build a SnpEff for NextProt (using NextProt's XML files). cds : Compare CDS sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness. closest : Annotate the closest genomic region. count : Count how many intervals (from a BAM, BED or VCF file) overlap with each genomic interval. databases : Show currently available databases (from local config file). download : Download a SnpEff database. dump : Dump to STDOUT a SnpEff database (mostly used for debugging). genes2bed : Create a bed file from a genes list. len : Calculate total genomic length for each marker type. pdb : Build interaction database (based on PDB data). protein : Compare protein sequences calculated form a SnpEff database to the one in a FASTA file. Used for checking databases correctness. show : Show a text representation of genes or transcripts coordiantes, DNA sequence and protein sequence. Generic options: -c , -config : Specify config file -configOption name=value : Override a config file option -d , -debug : Debug mode (very verbose). -dataDir <path> : Override data_dir parameter from config file. -download : Download a SnpEff database, if not available locally. Default: true -nodownload : Do not download a SnpEff database, if not available locally. -h , -help : Show this help and exit -noLog : Do not report usage statistics to server -t : Use multiple threads (implies '-noStats'). Default 'off' -q , -quiet : Quiet mode (do not show any messages or errors) -v , -verbose : Verbose mode -version : Show version number and exit Database options: -canon : Only use canonical transcripts. -interaction : Annotate using inteactions (requires interaciton database). Default: true -interval <file> : Use a custom intervals in TXT/BED/BigBed/VCF/GFF file (you may use this option many times) -maxTSL <TSL_number> : Only use transcripts having Transcript Support Level lower than <TSL_number>. -motif : Annotate using motifs (requires Motif database). Default: true -nextProt : Annotate using NextProt (requires NextProt database). -noGenome : Do not load any genomic database (e.g. annotate using custom files). -noInteraction : Disable inteaction annotations -noMotif : Disable motif annotations. -noNextProt : Disable NextProt annotations. -onlyReg : Only use regulation tracks. -onlyProtein : Only use protein coding transcripts. Default: false -onlyTr <file.txt> : Only use the transcripts in this file. Format: One transcript ID per line. -reg <name> : Regulation track to use (this option can be used add several times). -ss , -spliceSiteSize <int> : Set size for splice sites (donor and acceptor) in bases. Default: 2 -spliceRegionExonSize <int> : Set size for splice site region within exons. Default: 3 bases -spliceRegionIntronMin <int> : Set minimum number of bases for splice site region within intron. Default: 3 bases -spliceRegionIntronMax <int> : Set maximum number of bases for splice site region within intron. Default: 8 bases -strict : Only use 'validated' transcripts (i.e. sequence has been checked). Default: false -ud , -upDownStreamLen <int> : Set upstream downstream interval length (in bases)
• snpSift commands used to filter snpEff-annotated VCF data
java -jar $SNPEFF/SnpSift.jar -h SnpSift version 4.2 (build 2015-12-05), by Pablo Cingolani Usage: java -jar SnpSift.jar [command] params... Command is one of: alleleMat : Create an allele matrix output. annotate : Annotate 'ID' from a database (e.g. dbSnp). Assumes entries are sorted. annMem : Annotate 'ID' from a database (e.g. dbSnp). Loads db in memory. Does not assume sorted entries. caseControl : Compare how many variants are in 'case' and in 'control' groups; calculate p-values. ccs : Case control summary. Case and control summaries by region, allele frequency and variant's functional effect. concordance : Concordance metrics between two VCF files. covMat : Create an covariance matrix output (allele matrix as input). dbnsfp : Annotate with multiple entries from dbNSFP. extractFields : Extract fields from VCF file into tab separated format. filter : Filter using arbitrary expressions geneSets : Annotate using MSigDb gene sets (MSigDb includes: GO, KEGG, Reactome, BioCarta, etc.) gt : Add Genotype to INFO fields and remove genotype fields when possible. gtfilter : Filter genotype using arbitrary expressions. gwasCat : Annotate using GWAS catalog hwe : Calculate Hardy-Weimberg parameters and perform a godness of fit test. intersect : Intersect intervals (genomic regions). intervals : Keep variants that intersect with intervals. intIdx : Keep variants that intersect with intervals. Index-based method: Used for large VCF file and a few intervals to retrieve join : Join files by genomic region. phastCons : Annotate using conservation scores (phastCons). private : Annotate if a variant is private to a family or group. rmRefGen : Remove reference genotypes. rmInfo : Remove INFO fields. split : Split VCF by chromosome. tstv : Calculate transiton to transversion ratio. varType : Annotate variant type (SNP,MNP,INS,DEL or MIXED). vcfCheck : Check that VCF file is well formed. vcf2tped : Convert VCF to TPED. Options common to all SnpSift commands: -d : Debug. -download : Download database, if not available locally. Default: true. -noDownload : Do not download a database, if not available locally. -noLog : Do not report usage statistics to server. -h : Help. -v : Verbose.
Version number reported for each tool correspond to the lowest version used to prepare this training and may differ from the current version. Older versions may as well do!
Additional tools were installed from downloads or from source and are referenced to web links in the respective pages
People who have no experience with Linux command line should first follow the 'Linux for bioinformatics' training ([7]) or they will slow down the full class. All attendees who lack knowledge about Illumina reads should follow the preliminary one day 'Introduction to the analysis of NGS data' training ([8]).
Today's Hands-On Training
The Exercises
- NGS-Var Exercise.1: QC paired end reads using fastQC
- NGS-Var Exercise.2: Map a sample of the original paired end reads to the human reference genome hg19 using the Burrow Wheeler Aligner (BWA)
- NGS-Var Exercise.3: BWA mapping QC using Samtools, Picard, and Qualimap
- NGS-Var Exercise.4: Call variants as compared to the human reference genome (hg19) with samtools|bcftools or samtools|varscan
- NGS-Var Exercise.5: Compare and intersect VCF files using VcfTools and bedtools
- NGS-Var Exercise.6: Annotate and filter VCF variant lists with SNPSift and SnpEff
- NGS-Var Exercise.7: Review variants and annotations in IGV
Code developed in these exercises
Most scripts included in the exercises below are reproduced and maintained on our dedicated GitHub page NGS-Variant-Analysis-training-2016 [9] (please feed-us back about any issue with this page).
Find more tools and answers
There are many tools out there, finding them is often the easiest part. You are welcome to try as many as you wish and improve results obtained with our selected toolbox. When seeking advice, please consider using:
- SeqAnswers[10] for all what relates to NGS
- BioStar[11] for questions about biocomputing and scripting for biologists
- stackoverflow[12] for questions related to coding.
- The Turnkey Variant Analysis Project[13] to find Variant analysis resources.
- bioinformatics.ca directory[14] to find bioinformatics tools sorted by categories.
- The EBI online training propose many very nice training sessions with slides and exercises.
Conclusion
Your feedback to this general command-line based NGS data analysis session is very important to us and will be used to improve this content for later sessions. If you need more training of this kind, please contact us and we will organise additional hands-on based on your requests. More advance sessions will depend on the availability of expert users within VIB that will accept to prepare specified material.
PS: We tried to cite resources and people that made this training possible. If you feel that some citations are missing, please inform us and we will correct this omission.
References:
- ↑ http://www.1000genomes.org
- ↑ http://ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM18507
- ↑ http://www.gnu.org/software/coreutils/manual/coreutils.pdf
- ↑ http://picard.sourceforge.net
- ↑ https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf
- ↑ http://bedtools.readthedocs.org/
- ↑ http://wiki.bits.vib.be/index.php/Introduction_to_Linux_for_bioinformatics
- ↑ http://wiki.bits.vib.be/index.php/NGS_data_analysis
- ↑ https://github.com/BITS-VIB/NGS-Variant-Analysis-training-2016
- ↑ http://seqanswers.com SeqAnswers
- ↑ http://www.biostars.org BioStar
- ↑ http://stackoverflow.com stackoverflow
- ↑ http://tvap.genome.wustl.edu/
- ↑ http://bioinformatics.ca/links_directory
- ↑ https://broadinstitute.github.io/picard/explain-flags.html
[ Main_Page ]