Hands-on introduction to NGS RNASeq DE analysis
# last edit, April 24 2015
[ Main_Page | NGS_data_analysis ]
Hands-on introduction to NGS RNASeq DE analysis-laptop-file-list
PDF handout of these pages (might contain outdated information!)
Contents
Aims of the NGS RNASeq differential expression analysis 2-days session
Using publicly available Illumina RNASeq samples to:
- Walk through a complete analysis workflow including read QC, read mapping to the human hg19 genome, and counting the reads mapping to each human transcript.
- Perform differential expression analysis using several popular R/Bioconductor tools and obtain gene-level data ready for functional analysis and biological interpretation.
The skills acquired during this session should allow reproduction of the workflow on a reasonably-sized personal computer running Unix (including mac OSX)
All scripts as well as the most important result files are put online to allow offline practice to trainees and web readers
Not covered during this training
This RNASeq basic training will fit the needs of all scientists who wish to compare pairwise experimental conditions to identify differentially transcribed genes and relate their results to a pre-defined phenotype. Other HT-Seq applications including the analysis of Gene regulation at the chromatin level, by transcription modulation or the transcript regulation by ncRNA or miRNA are not covered here.
All CLI applications used during the session are free for research use and can be installed on a personal computer with a minimum of hassle, we will be happy to support you in the installation of the necessary tools upon request.
This session does not use the popular 'Galaxy' system because of the absence of a dedicated server and of the time overhead associated with data input and output under this system. We nevertheless recommend using Galaxy if your aim is to practice NGS tools at small scale on a desktop computer used by you and few colleagues
- This training is restricted to contrasts between pairwise experimental conditions. Multiple Anova or other global statistical methods will not be used here.
- This protocol is not valid to analyze time series experiments
- This protocol is not valid for dose response analysis
- This protocol is not valid to analyze survival effects associated with gene expression
- This protocol does not de-novo assemble reads but instead maps reads to a known transcriptome model (it is therefore not directly applicable to organisms for which no transcriptome is available)
- Although this workflow uses transcript information during the mapping, this protocol does not consider splice variants and maps all results to Gene models for further functional analysis. Splice variant analysis is still in its infancy today and is limited by the read-length not allowing yet to sequence full transcripts in one piece and by the very little known about specific transcript isoforms at functional level (protein).
For each of these additional applications, specific software and approaches are required that fall far beyond the scope of this introductory training session.
- The training stops with the definition of differentially expressed genes and does not include downstream biological interpretation of the results at the functional level. For the latest, please refer to follow-up sessions covering the use of Ingenuity Pathway Analysis or related tools.
More BITS Training Info
- On the BITS website: https://www.bits.vib.be/index.php/training/
Summary
This training gives an introduction to the use of popular NGS analysis software packages at the command line tools for more advanced Users or Users having needs exceeding the capacity of the defunct BITS Galaxy platform. All code fragments used in the training pages can easily be recycled and adapted, and will constitute the basis of your own workflows. This training does not cover all currently available methods. It does not aim at bringing users to a professional NGS analyst level but provides enough information to allow motivated biologists understand what DNA sequencing practically is, to read and reuse command-line code, and when necessary to communicate knowingly with NGS experts for more in-depth needs.
The sequencing data used in this session was obtained from the SRA under the reference SRP012167 [1]. The dedicated page Download read information and FASTQ data from the SRA details how to download and reformat data from SRA (ENA) and can be used as guide to obtain other published datasets
Prerequisites
Skills required to follow this training:
- Linux command line basic skills are absolutely required to understand the code used throughout the CLI part of this session
- Basic knowledge of human genome structure, transcriptome structure, and nomenclature is necessary to evaluate and understand the training steps
- Basic knowledge of Illumina NGS read structure is also required for the same reason
People who have no experience with Linux command line should first follow the 'Linux for bioinformatics' training ([2]) or they will slow down the full class. All attendees who lack knowledge about Illumina reads should follow the preliminary one day 'Introduction to the analysis of NGS data' training ([3]). Self-training on Galaxy is advised to take full advantage on this session (please refer to the BITS Galaxy pages).
Hands-On Training
Housekeeping rules
- All Command Line Interface exercises were performed by the trainer on a similar laptops and the results moved to a new subfolder called Final_Results. Your own results will be stored in similarly named folders but created directly under your working folder and should not overwrite the 'documented' version stored in Final_Results.
- Additionally, key results were copied on the BITS server to allow you practice the sessions at later points in time on your own computer and for our web visitors. These measures ensure that you can take up any exercise and will find the necessary input data in the corresponding folder even if you did not perform the previous exercises.
- Finally, key result files were copied on the BITS file server and can be downloaded from links present at the bottom of each exercise page.
The Exercises
Exercise pages below are split into two groups. i) Exercises performed by the trainer to prepare today's data. These steps do not need be repeated during our sessions and are quite lengthy but will be necessary for visitors who want to reproduce this training on their own machine. ii) exercises repeated during the session and running faster to make you feel the real job. Note that we do not detail installation of all tools which represents in itself a small challenge for the apprentice but can be achieved with some efforts and after reading carefully the instructions associated with each package.
Code developed in these exercises
Most scripts included in the exercises below are reproduced and maintained on our dedicated GitHub page https://github.com/BITS-VIB/RNAseq-training [4] (please feed-us back about any issue with this page).
Required to prepare data for this session but not covered
- NGS_RNASeq_DE Exercise.1: Download Illumina data from SRA and perform read QC
- NGS_RNASeq_DE Exercise.1a: Download the bowtie2 reference genome index for hg19 and related files
- NGS_RNASeq_DE Exercise.1b: Map SRR1039509 reads to the current reference (hg19) and extract paired-end reads mapping to chr22
- NGS_RNASeq_DE Exercise.1c: Effect of trimming SRR1039509 reads on their ability to map to the reference
In the upper exercises, the original read data was downloaded, cleaned, and mapped to the reference genome. Mapping address of the obtained BAM data was used to extract all reads aligning with chr22. This small subset of the original experiment was prepared as input material for the remaining of the hands-on session detailed below.
Hands-On Session exercises
- NGS_RNASeq_DE Exercise.2: Map paired-end chr22 reads to the human hg19 reference genome using Tophat2
- NGS_RNASeq_DE Exercise.3: Control mappings with RSeQC, or Qualimap, and mark Duplicates with Picard
- NGS_RNASeq_DE Exercise.4: Count reads at gene level using HTSeq
Differential expression analysis in R-bioconductor and using GUI apps
- NGS_RNASeq_DE Exercise.5: The rnaseqGene: DESeq2 R-workflow [5]
- NGS_RNASeq_DE Exercise.6: Alternate R-workflow using EdgeR [6]
- NGS_RNASeq_DE Exercise.7: Compute DE using RobiNA[7] (GUI wrapper to DESeq1 and EdgeR) and gene counts files
- NGS_RNASeq_DE Exercise.8: Wrap-Up comparison of RNASeq results obtained in the different exercises
Training Material
Most of the content produced during today's session was added on the BITS platform (see links at bottom of each exercise page)
Today's main tools
In addition to the many built-in unix command, originating from the GNU consortium for a good part (please read the GNU coreutils manual for more [8]), we will cite or use today several dedicated toolboxes listed below in order of appearance. Version numbers are indicative and may have progressed since the last edit of this page.
• fastQC (v0.11.2)[9], 'the' tool for reads QC
splaisan@stelap:/work/TUTORIALS/NGS_RNASeqDE-training2015/final_test$ fastQC --help
FastQC - A high throughput sequence QC analysis tool
SYNOPSIS
fastqc seqfile1 seqfile2 .. seqfileN
fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
[-c contaminant file] seqfile1 .. seqfileN
DESCRIPTION
FastQC reads a set of sequence files and produces from each one a quality
control report consisting of a number of different modules, each one of
which will help to identify a different potential type of problem in your
data.
If no files to process are specified on the command line then the program
will start as an interactive graphical application. If files are provided
on the command line then the program will run with no user interaction
required. In this mode it is suitable for inclusion into a standardised
analysis pipeline.
The options for the program as as follows:
-h --help Print this help file and exit
-v --version Print the version of the program and exit
-o --outdir Create all output files in the specified output directory.
Please note that this directory must exist as the program
will not create it. If this option is not set then the
output file for each sequence file is created in the same
directory as the sequence file which was processed.
--casava Files come from raw casava output. Files in the same sample
group (differing only by the group number) will be analysed
as a set rather than individually. Sequences with the filter
flag set in the header will be excluded from the analysis.
Files must have the same names given to them by casava
(including being gzipped and ending with .gz) otherwise they
won't be grouped together correctly.
--nofilter If running with --casava then don't remove read flagged by
casava as poor quality when performing the QC analysis.
--extract If set then the zipped output file will be uncompressed in
the same directory after it has been created. By default
this option will be set if fastqc is run in non-interactive
mode.
-j --java Provides the full path to the java binary you want to use to
launch fastqc. If not supplied then java is assumed to be in
your path.
--noextract Do not uncompress the output file after creating it. You
should set this option if you do not wish to uncompress
the output when running in non-interactive mode.
--nogroup Disable grouping of bases for reads >50bp. All reports will
show data for every base in the read. WARNING: Using this
option will cause fastqc to crash and burn if you use it on
really long reads, and your plots may end up a ridiculous size.
You have been warned!
-f --format Bypasses the normal sequence file format detection and
forces the program to use the specified format. Valid
formats are bam,sam,bam_mapped,sam_mapped and fastq
-t --threads Specifies the number of files which can be processed
simultaneously. Each thread will be allocated 250MB of
memory so you shouldn't run more threads than your
available memory will cope with, and not more than
6 threads on a 32 bit machine
-c Specifies a non-default file which contains the list of
--contaminants contaminants to screen overrepresented sequences against.
The file must contain sets of named contaminants in the
form name[tab]sequence. Lines prefixed with a hash will
be ignored.
-a Specifies a non-default file which contains the list of
--adapters adapter sequences which will be explicity searched against
the library. The file must contain sets of named adapters
in the form name[tab]sequence. Lines prefixed with a hash
will be ignored.
-l Specifies a non-default file which contains a set of criteria
--limits which will be used to determine the warn/error limits for the
various modules. This file can also be used to selectively
remove some modules from the output all together. The format
needs to mirror the default limits.txt file found in the
Configuration folder.
-k --kmers Specifies the length of Kmer to look for in the Kmer content
module. Specified Kmer length must be between 2 and 10. Default
length is 7 if not specified.
-q --quiet Supress all progress messages on stdout and only report errors.
-d --dir Selects a directory to be used for temporary files written when
generating report images. Defaults to system temp directory if
not specified.
BUGS
Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk
or in www.bioinformatics.babraham.ac.uk/bugzilla/
• cutadapt (v1.7.1) [10], removes adapter sequences from DNA high-throughput sequencing data.
Copyright (C) 2010-2014 Marcel Martin <marcel.martin@scilifelab.se>
cutadapt removes adapter sequences from high-throughput sequencing reads.
Usage:
cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq
Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
characters are supported. The reverse complement is *not* automatically
searched. All reads from input.fastq will be written to output.fastq with the
adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter
sequences can be given (use further -a options), but only the best-matching
adapter will be removed.
Input may also be in FASTA format. Compressed input and output is supported and
auto-detected from the file name (.gz or .bz2). Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.
Some other available features are:
* Various other adapter types (5' adapters, "mixed" 5'/3' adapters etc.)
* Trimming a fixed number of bases
* Quality trimming
* Trimming paired-end reads
* Trimming colorspace reads
* Filtering reads by various criteria
Use "cutadapt --help" to see all command-line options.
See http://cutadapt.readthedocs.org/ for the full documentation.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-f FORMAT, --format=FORMAT
Input file format; can be either 'fasta', 'fastq' or
'sra-fastq'. Ignored when reading csfasta/qual files
(default: auto-detect from file name extension).
Options that influence how the adapters are found:
Each of the following three parameters (-a, -b, -g) can be used
multiple times and in any combination to search for an entire set of
adapters of possibly different types. Only the best matching adapter
is trimmed from each read (but see the --times option). Instead of
giving an adapter directly, you can also write file:FILE and the
adapter sequences will be read from the given FILE (which must be in
FASTA format).
-a ADAPTER, --adapter=ADAPTER
Sequence of an adapter that was ligated to the 3' end.
The adapter itself and anything that follows is
trimmed. If the adapter sequence ends with the '$'
character, the adapter is anchored to the end of the
read and only found if it is a suffix of the read.
-g ADAPTER, --front=ADAPTER
Sequence of an adapter that was ligated to the 5' end.
If the adapter sequence starts with the character '^',
the adapter is 'anchored'. An anchored adapter must
appear in its entirety at the 5' end of the read (it
is a prefix of the read). A non-anchored adapter may
appear partially at the 5' end, or it may occur within
the read. If it is found within a read, the sequence
preceding the adapter is also trimmed. In all cases,
the adapter itself is trimmed.
-b ADAPTER, --anywhere=ADAPTER
Sequence of an adapter that was ligated to the 5' or
3' end. If the adapter is found within the read or
overlapping the 3' end of the read, the behavior is
the same as for the -a option. If the adapter overlaps
the 5' end (beginning of the read), the initial
portion of the read matching the adapter is trimmed,
but anything that follows is kept.
-e ERROR_RATE, --error-rate=ERROR_RATE
Maximum allowed error rate (no. of errors divided by
the length of the matching region) (default: 0.1)
--no-indels Do not allow indels in the alignments (allow only
mismatches). Currently only supported for anchored
adapters. (default: allow both mismatches and indels)
-n COUNT, --times=COUNT
Try to remove adapters at most COUNT times. Useful
when an adapter gets appended multiple times (default:
1).
-O LENGTH, --overlap=LENGTH
Minimum overlap length. If the overlap between the
read and the adapter is shorter than LENGTH, the read
is not modified. This reduces the no. of bases trimmed
purely due to short random adapter matches (default:
3).
--match-read-wildcards
Allow IUPAC wildcards in reads (default: False).
-N, --no-match-adapter-wildcards
Do not interpret IUPAC wildcards in adapters.
Options for filtering of processed reads:
--discard-trimmed, --discard
Discard reads that contain the adapter instead of
trimming them. Also use -O in order to avoid throwing
away too many randomly matching reads!
--discard-untrimmed, --trimmed-only
Discard reads that do not contain the adapter.
-m LENGTH, --minimum-length=LENGTH
Discard trimmed reads that are shorter than LENGTH.
Reads that are too short even before adapter removal
are also discarded. In colorspace, an initial primer
is not counted (default: 0).
-M LENGTH, --maximum-length=LENGTH
Discard trimmed reads that are longer than LENGTH.
Reads that are too long even before adapter removal
are also discarded. In colorspace, an initial primer
is not counted (default: no limit).
--no-trim Match and redirect reads to output/untrimmed-output as
usual, but do not remove adapters.
--mask-adapter Mask adapters with 'N' characters instead of trimming
them.
Options that influence what gets output to where:
--quiet Do not print a report at the end.
-o FILE, --output=FILE
Write modified reads to FILE. FASTQ or FASTA format is
chosen depending on input. The summary report is sent
to standard output. Use '{name}' in FILE to
demultiplex reads into multiple files. (default:
trimmed reads are written to standard output)
-p FILE, --paired-output=FILE
Write reads from the paired-end input to FILE.
--info-file=FILE Write information about each read and its adapter
matches into FILE. See the documentation for the file
format.
-r FILE, --rest-file=FILE
When the adapter matches in the middle of a read,
write the rest (after the adapter) into FILE.
--wildcard-file=FILE
When the adapter has wildcard bases ('N's), write
adapter bases matching wildcard positions to FILE.
When there are indels in the alignment, this will
often not be accurate.
--too-short-output=FILE
Write reads that are too short (according to length
specified by -m) to FILE. (default: discard reads)
--too-long-output=FILE
Write reads that are too long (according to length
specified by -M) to FILE. (default: discard reads)
--untrimmed-output=FILE
Write reads that do not contain the adapter to FILE.
(default: output to same file as trimmed reads)
--untrimmed-paired-output=FILE
Write the second read in a pair to this FILE when no
adapter was found in the first read. Use this option
together with --untrimmed-output when trimming paired-
end reads. (Default: output to same file as trimmed
reads.)
Additional modifications to the reads:
-u LENGTH, --cut=LENGTH
Remove bases from the beginning or end of each read.
If LENGTH is positive, the bases are removed from the
beginning of each read. If LENGTH is negative, the
bases are removed from the end of each read. This
option can be specified twice if the LENGTHs have
different signs.
-q CUTOFF, --quality-cutoff=CUTOFF
Trim low-quality bases from 3' ends of reads before
adapter removal. The algorithm is the same as the one
used by BWA (Subtract CUTOFF from all qualities;
compute partial sums from all indices to the end of
the sequence; cut sequence at the index at which the
sum is minimal) (default: 0)
--quality-base=QUALITY_BASE
Assume that quality values are encoded as
ascii(quality + QUALITY_BASE). The default (33) is
usually correct, except for reads produced by some
versions of the Illumina pipeline, where this should
be set to 64. (Default: 33)
-x PREFIX, --prefix=PREFIX
Add this prefix to read names
-y SUFFIX, --suffix=SUFFIX
Add this suffix to read names
--strip-suffix=STRIP_SUFFIX
Remove this suffix from read names if present. Can be
given multiple times.
-c, --colorspace Colorspace mode: Also trim the color that is adjacent
to the found adapter.
-d, --double-encode
When in colorspace, double-encode colors (map
0,1,2,3,4 to A,C,G,T,N).
-t, --trim-primer When in colorspace, trim primer base and the first
color (which is the transition to the first
nucleotide)
--strip-f3 For colorspace: Strip the _F3 suffix of read names
--maq, --bwa MAQ- and BWA-compatible colorspace output. This
enables -c, -d, -t, --strip-f3 and -y '/1'.
--length-tag=TAG Search for TAG followed by a decimal number in the
name of the read (description/comment field of the
FASTA or FASTQ file). Replace the decimal number with
the correct length of the trimmed read. For example,
use --length-tag 'length=' to correct fields like
'length=123'.
--no-zero-cap Do not change negative quality values to zero.
Colorspace quality values of -1 would appear as spaces
in the output FASTQ file. Since many tools have
problems with that, negative qualities are converted
to zero when trimming colorspace data. Use this option
to keep negative qualities.
-z, --zero-cap Change negative quality values to zero. This is
enabled by default when -c/--colorspace is also
enabled. Use the above option to disable it.
• PRINSEQ (PRINSEQ-lite 0.20.4)[11][12], PReprocessing and INformation of SEQuence data (a great perl tool replacing Galaxy Groomer+FastQ quality trimmer+...+many more options are available)
perl prinseq-lite.pl [-h] [-help] [-version] [-man] [-verbose] [-fastq
input_fastq_file] [-fasta input_fasta_file] [-fastq2 input_fastq_file_2]
[-fasta2 input_fasta_file_2] [-qual input_quality_file] [-min_len
int_value] [-max_len int_value] [-range_len ranges] [-min_gc int_value]
[-max_gc int_value] [-range_gc ranges] [-min_qual_score int_value]
[-max_qual_score int_value] [-min_qual_mean int_value] [-max_qual_mean
int_value] [-ns_max_p int_value] [-ns_max_n int_value] [-noniupac]
[-seq_num int_value] [-derep int_value] [-derep_min int_value]
[-lc_method method_name] [-lc_threshold int_value] [-trim_to_len
int_value] [-trim_left int_value] [-trim_right int_value] [-trim_left_p
int_value] [-trim_right_p int_value] [-trim_ns_left int_value]
[-trim_ns_right int_value] [-trim_tail_left int_value] [-trim_tail_right
int_value] [-trim_qual_left int_value] [-trim_qual_right int_value]
[-trim_qual_type type] [-trim_qual_rule rule] [-trim_qual_window
int_value] [-trim_qual_step int_value] [-seq_case case] [-dna_rna type]
[-line_width int_value] [-rm_header] [-seq_id id_string] [-out_format
int_value] [-out_good filename_prefix] [-out_bad filename_prefix]
[-phred64] [-stats_info] [-stats_len] [-stats_dinuc] [-stats_tag]
[-stats_dupl] [-stats_ns] [-stats_assembly] [-stats_all] [-aa]
[-graph_data file] [-graph_stats string] [-qual_noscale]
[-no_qual_header] [-exact_only] [-log file] [-custom_params string]
[-params file] [-seq_id_mappings file]
Options:
-help | -h
Print the help message; ignore other arguments.
-man Print the full documentation; ignore other arguments.
-version
Print program version; ignore other arguments.
-verbose
Prints status and info messages during processing.
***** INPUT OPTIONS *****
-fastq <file>
Input file in FASTQ format that contains the sequence and
quality data. Use stdin instead of a file name to read from
STDIN (-fasta stdin). This can be useful to process compressed
files using Unix pipes.
-fasta <file>
Input file in FASTA format that contains the sequence data. Use
stdin instead of a file name to read from STDIN (-fastq stdin).
This can be useful to process compressed files using Unix pipes.
-qual <file>
Input file in QUAL format that contains the quality data.
-fastq2 <file>
For paired-end data only. Input file in FASTQ format that
contains the sequence and quality data. The sequence identifiers
for two matching paired-end sequences in separate files can be
marked by /1 and /2, or _L and _R, or _left and _right, or must
have the exact same identifier in both input files. The input
sequences must be sorted by their sequence identifiers.
Singletons are allowed in the input files.
-fasta2 <file>
For paired-end data only. Input file in FASTA format that
contains the sequence data. The sequence identifiers for two
matching paired-end sequences in separate files can be marked by
/1 and /2, or _L and _R, or _left and _right, or must have the
exact same identifier in both input files. The input sequences
must be sorted by their sequence identifiers. Singletons are
allowed in the input files.
-params <file>
Input file in text format that contains PRINSEQ parameters. Each
parameter should be specified on a new line and arguments should
be separated by spaces or tabs. Comments can be specified on
lines starting with the # sign. Can be combined with command
line parameters. Parameters specified on the command line will
overwrite the arguments in the file (if any).
-si13 This option was replaced by option -phred64.
-phred64
Quality data in FASTQ file is in Phred+64 format
(http://en.wikipedia.org/wiki/FASTQ_format#Encoding). Not
required for Illumina 1.8+, Sanger, Roche/454, Ion Torrent,
PacBio data.
-aa Input is amino acid (protein) sequences instead of nucleic acid
(DNA or RNA) sequences. Allowed amino acid characters:
ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*- and allowed
nucleic acid characters: ACGTURYKMSWBDHVNXacgturykmswbdhvnx-
The following options are ignored for -aa:
stats_dinuc,stats_tag,stats_ns,dna_rna
***** OUTPUT OPTIONS *****
-out_format <integer>
To change the output format, use one of the following options.
If not defined, the output format will be the same as the input
format.
1 (FASTA only), 2 (FASTA and QUAL), 3 (FASTQ), 4 (FASTQ and
FASTA), or 5 (FASTQ, FASTA and QUAL)
-out_good <string>
By default, the output files are created in the same directory
as the input file containing the sequence data with an
additional "_prinseq_good_XXXX" in their name (where XXXX is
replaced by random characters to prevent overwriting previous
files). To change the output filename and location, specify the
filename using this option. The file extension will be added
automatically (either .fasta, .qual, or .fastq). For paired-end
data, filenames contain additionally "_1", "_1_singletons",
"_2", and "_2_singletons" before the file extension. Use
"-out_good null" to prevent the program from generating the
output file(s) for data passing all filters. Use "-out_good
stdout" to write data passing all filters to STDOUT (only for
FASTA or FASTQ output files).
Example: use "file_passed" to generate the output file
file_passed.fasta in the current directory
-out_bad <string>
By default, the output files are created in the same directory
as the input file containing the sequence data with an
additional "_prinseq_bad_XXXX" in their name (where XXXX is
replaced by random characters to prevent overwriting previous
files). To change the output filename and location, specify the
filename using this option. The file extension will be added
automatically (either .fasta, .qual, or .fastq). For paired-end
data, filenames contain additionally "_1" and "_2" before the
file extension. Use "-out_bad null" to prevent the program from
generating the output file(s) for data not passing any filter.
Use "-out_bad stdout" to write data not passing any filter to
STDOUT (only for FASTA or FASTQ output files).
Example: use "file_filtered" to generate the output file
file_filtered.fasta in the current directory
Example: "-out_good stdout -out_bad null" will write data
passing filters to STDOUT and data not passing any filter will
be ignored
-log <file>
Log file to keep track of parameters, errors, etc. The log file
name is optional. If no file name is given, the log file name
will be "inputname.log". If the log file already exists, new
content will be added to the file.
-graph_data <file>
File that contains the necessary information to generate the
graphs similar to the ones in the web version. The file name is
optional. If no file name is given, the file name will be
"inputname.gd". If the file already exists, new content will
overwrite the file. Use "-out_good null -out_bad null" to
prevent generating any additional outputs. (See below for more
options related to the graph data.)
The graph data can be used as input for the prinseq-graphs.pl
file to generate the PNG graph files or an HTML report file. If
you have trouble installing the required prinseq-graphs.pl
modules or want to see an output example report, upload the
graph data file at: http://edwards.sdsu.edu/prinseq/ -> Choose
"Get Report"
-graph_stats <string>
Use this option to select what statistics should be calculated
and included in the graph_data file. This is useful if you e.g.
do not need sequence complexity information, which requires a
lot of computation. Requires to have graph_data specified.
Default is all selected.
Allowed option are (separate multiple by comma with no spaces):
ld (Length distribution), gc (GC content distribution), qd (Base
quality distribution), ns (Occurence of N), pt (Poly-A/T tails),
ts (Tag sequence check), aq (Assembly quality measure), de
(Sequence duplication - exact only), da (Sequence duplication -
exact + 5'/3'), sc (Sequence complexity), dn (Dinucleotide odds
ratios, includes the PCA plots)
Example use: -graph_stats ld,gc,qd,de
-qual_noscale
Use this option if all your sequences are shorter than 100bp as
they do not require to scale quality data to 100 data points in
the graph. By default, quality scores of sequences shorter than
100bp or longer than 100bp are fit to 100 data points. (To
retrieve this information and calculate the graph data would
otherwise require to parse the data two times or store all the
quality data in memory.)
-no_qual_header
In order to reduce the file size, this option will generate an
empty header line for the quality data in FASTQ files. Instead
of +header, only the + sign will be output. The header of the
sequence data will be left unchanged. This option applies to
FASTQ output files only.
-exact_only
Use this option to check for exact (forward and reverse)
duplicates only when generating the graph data. This allows to
keep the memory requirements low for large input files and is
faster. This option will automatically be applied when using
-derep options 1 and/or 4 only. Specify option -derep 1 or
-derep 4 if you do not want to apply both at the same time.
-seq_id_mappings <file>
Text file containing the old and new (specified with -seq_id)
identifiers for later reference. This option is useful if e.g. a
renamed sequence has to be identified based on the new sequence
identifier. The file name is optional. If no file name is given,
the file name will be "inputname_prinseq_good.ids" (only good
sequences are renamed). If a file with the same name already
exists, new content will overwrite the old file. The text file
contains one sequence identifier pair per line, separated by
tabs (old-tab-new). Requires option -seq_id.
***** FILTER OPTIONS *****
-min_len <integer>
Filter sequence shorter than min_len.
-max_len <integer>
Filter sequence longer than max_len.
-range_len <string>
Filter sequence by length range. Multiple range values should be
separated by comma without spaces.
Example: -range_len 50-100,250-300
-min_gc <integer>
Filter sequence with GC content below min_gc.
-max_gc <integer>
Filter sequence with GC content above max_gc.
-range_gc <string>
Filter sequence by GC content range. Multiple range values
should be separated by comma without spaces.
Example: -range_gc 50-60,75-90
-min_qual_score <integer>
Filter sequence with at least one quality score below
min_qual_score.
-max_qual_score <integer>
Filter sequence with at least one quality score above
max_qual_score.
-min_qual_mean <integer>
Filter sequence with quality score mean below min_qual_mean.
-max_qual_mean <integer>
Filter sequence with quality score mean above max_qual_mean.
-ns_max_p <integer>
Filter sequence with more than ns_max_p percentage of Ns.
-ns_max_n <integer>
Filter sequence with more than ns_max_n Ns.
-noniupac
Filter sequence with characters other than A, C, G, T or N.
-seq_num <integer>
Only keep the first seq_num number of sequences (that pass all
other filters).
-derep <integer>
Type of duplicates to filter. Allowed values are 1, 2, 3, 4 and
5. Use integers for multiple selections (e.g. 124 to use type 1,
2 and 4). The order does not matter. Option 2 and 3 will set 1
and option 5 will set 4 as these are subsets of the other
option.
1 (exact duplicate), 2 (5' duplicate), 3 (3' duplicate), 4
(reverse complement exact duplicate), 5 (reverse complement
5'/3' duplicate)
-derep_min <integer>
This option specifies the number of allowed duplicates. If you
want to remove sequence duplicates that occur more than x times,
then you would specify x+1 as the -derep_min values. For
examples, to remove sequences that occur more than 5 times, you
would specify -derep_min 6. This option can only be used in
combination with -derep 1 and/or 4 (forward and/or reverse exact
duplicates). [default : 2]
-lc_method <string>
Method to filter low complexity sequences. The current options
are "dust" and "entropy". Use "-lc_method dust" to calculate the
complexity using the dust method.
-lc_threshold <integer>
The threshold value (between 0 and 100) used to filter sequences
by sequence complexity. The dust method uses this as maximum
allowed score and the entropy method as minimum allowed value.
-custom_params <string>
Can be used to specify additional filters. The current set of
possible rules is limited and has to follow the specifications
below. The custom parameters have to be specified within quotes
(either ' or ").
Please separate parameter values with a space and separate new
parameter sets with semicolon (;). Parameters are defined by two
values: (1) the pattern (any combination of the letters
"ACGTN"), (2) the number of repeats or percentage of occurence
Percentage values are defined by a number followed by the %-sign
(without space). If no %-sign is given, it is assumed that the
given number specifies the number of repeats of the pattern.
Examples: "AAT 10" (filters out sequences containing
AATAATAATAATAATAATAATAATAATAAT anywhere in the sequence), "T
70%" (filters out sequences with more than 70% Ts in the
sequence), "A 15" (filters out sequences containing
AAAAAAAAAAAAAAA anywhere in the sequence), "AAT 10;T 70%;A 15"
(apply all three filters)
***** TRIM OPTIONS *****
-trim_to_len <integer>
Trim all sequence from the 3'-end to result in sequence with
this length.
-trim_left <integer>
Trim sequence at the 5'-end by trim_left positions.
-trim_right <integer>
Trim sequence at the 3'-end by trim_right positions.
-trim_left_p <integer>
Trim sequence at the 5'-end by trim_left_p percentage of read
length. The trim length is rounded towards the lower integer
(e.g. 143.6 is rounded to 143 positions). Use an integer between
1 and 100 for the percentage value.
-trim_right_p <integer>
Trim sequence at the 3'-end by trim_right_p percentage of read
length. The trim length is rounded towards the lower integer
(e.g. 143.6 is rounded to 143 positions). Use an integer between
1 and 100 for the percentage value.
-trim_tail_left <integer>
Trim poly-A/T tail with a minimum length of trim_tail_left at
the 5'-end.
-trim_tail_right <integer>
Trim poly-A/T tail with a minimum length of trim_tail_right at
the 3'-end.
-trim_ns_left <integer>
Trim poly-N tail with a minimum length of trim_ns_left at the
5'-end.
-trim_ns_right <integer>
Trim poly-N tail with a minimum length of trim_ns_right at the
3'-end.
-trim_qual_left <integer>
Trim sequence by quality score from the 5'-end with this
threshold score.
-trim_qual_right <integer>
Trim sequence by quality score from the 3'-end with this
threshold score.
-trim_qual_type <string>
Type of quality score calculation to use. Allowed options are
min, mean, max and sum. [default: min]
-trim_qual_rule <string>
Rule to use to compare quality score to calculated value.
Allowed options are lt (less than), gt (greater than) and et
(equal to). [default: lt]
-trim_qual_window <integer>
The sliding window size used to calculate quality score by type.
To stop at the first base that fails the rule defined, use a
window size of 1. [default: 1]
-trim_qual_step <integer>
Step size used to move the sliding window. To move the window
over all quality scores without missing any, the step size
should be less or equal to the window size. [default: 1]
***** REFORMAT OPTIONS *****
-seq_case <string>
Changes sequence character case to upper or lower case. Allowed
options are "upper" and "lower". Use this option to remove
soft-masking from your sequences.
-dna_rna <string>
Convert sequence between DNA and RNA. Allowed options are "dna"
(convert from RNA to DNA) and "rna" (convert from DNA to RNA).
-line_width <integer>
Sequence characters per line. Use 0 if you want each sequence in
a single line. Use 80 for line breaks every 80 characters. Note
that this option only applies to FASTA output files, since FASTQ
files store sequences without additional line breaks. [default:
60]
-rm_header
Remove the sequence header. This includes everything after the
sequence identifier (which is kept unchanged).
-seq_id <string>
Rename the sequence identifier. A counter is added to each
identifier to assure its uniqueness. Use option -seq_id_mappings
to generate a file containing the old and new identifiers for
later reference.
Example: "mySeq_10" will generate the IDs (in FASTA format)
>mySeq_101, >mySeq_102, >mySeq_103, ...
***** SUMMARY STATISTIC OPTIONS *****
The summary statistic values are written to STDOUT in the form:
"parameter_name statistic_name value" (without the quotes). For
example, "stats_info reads 10000" or "stats_len max 500". Only
one statistic is written per line and values are separated by
tabs.
If you specify any statistic option, no other ouput will be
generated. To preprocess data, do not specify a statistics
option.
-stats_info
Outputs basic information such as number of reads (reads) and
total bases (bases).
-stats_len
Outputs minimum (min), maximum (max), range (range), mean
(mean), standard deviation (stddev), mode (mode) and mode value
(modeval), and median (median) for read length.
-stats_dinuc
Outputs the dinucleotide odds ratio for AA/TT (aatt), AC/GT
(acgt), AG/CT (agct), AT (at), CA/TG (catg), CC/GG (ccgg), CG
(cg), GA/TC (gatc), GC (gc) and TA (ta).
-stats_tag
Outputs the probability of a tag sequence at the 5'-end (prob5)
and 3'-end (prob3) in percentage (0..100). Provides the number
of predefined MIDs (midnum) and the MID sequences (midseq,
separated by comma, only provided if midnum > 0) that occur in
more than 34/100 (approx. 3%) of the reads.
-stats_dupl
Outputs the number of exact duplicates (exact), 5' duplicates
(5), 3' duplicates (3), exact duplicates with reverse
complements (exactrevcom) and 5'/3' duplicates with reverse
complements (revcomp), and total number of duplicates (total).
The maximum number of duplicates is given under the value name
with an additional "maxd" (e.g. exactmaxd or 5maxd).
-stats_ns
Outputs the number of reads with ambiguous base N (seqswithn),
the maximum number of Ns per read (maxn) and the maximum
percentage of Ns per read (maxp). The maxn and maxp value are
not necessary from the same sequence.
-stats_assembly
Outputs the N50, N90, etc contig sizes. The Nxx contig size is a
weighted median that is defined as the length of the smallest
contig C in the sorted list of all contigs where the cumulative
length from the largest contig to contig C is at least xx% of
the total length (sum of contig lengths).
-stats_all
Outputs all available summary statistics.
***** ORDER OF PROCESSING *****
The available options are processed in the following order:
seq_num, trim_left, trim_right, trim_left_p, trim_right_p,
trim_qual_left, trim_qual_right, trim_tail_left,
trim_tail_right, trim_ns_left, trim_ns_right, trim_to_len,
min_len, max_len, range_len, min_qual_score, max_qual_score,
min_qual_mean, max_qual_mean, min_gc, max_gc, range_gc,
ns_max_p, ns_max_n, noniupac, lc_method, derep, seq_id,
seq_case, dna_rna, out_format
• Groomer [13], a python script, developed for and part of Galaxy, and used to convert between fastq formats
## Otherwise, please consider 'Prinseq-lite.pl' which allows many more filtering steps from a single command.
# fastq_groomer.py
# expected parameters from the code header. The 'advanced' galaxy defaults are shown in every second line
$1 input_filename = sys.argv[1] (fastq file)
File to groom: <from your history>
$2 input_type = sys.argv[2] (phred solexa, sanger, illumina, cssanger ...)
Input FASTQ quality scores type: 'Sanger'
$3 output_filename = sys.argv[3] (fastq file)
<not shown => new history object>
$4 output_type = sys.argv[4] (phred solexa, sanger, illumina, cssanger ...)
Output FASTQ quality scores type: 'Sanger (recommended)'
$5 force_quality_encoding = sys.argv[5] | (None, 'ascii', 'decimal')
Force Quality Score encoding: 'ASCII'
$6 summarize_input = sys.argv[6] == 'summarize_input' (summarize_input, 'None' or any other text for none)
Summarize input data: 'Summarized Input'
## example command
### => finds 'fastq-groomer.py' in the PATH to avoid having to give its full location
python2.7 $(which fastq_groomer.py) \
some-illumina-data.fastq \
illumina \
groomed.fastq \
sanger \
None \
summarize_input
• fastq-mcf (v1.04.676), fastQ filtering toolbox part of the Google-code package ea-utils [14]
The **fastq-mcf** command expects the following input and output parameters
Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]
Version: 1.04.676
Detects levels of adapter presence, computes likelihoods and
locations (start, end) of the adapters. Removes the adapter
sequences from the fastq file(s).
Stats go to stderr, unless -o is specified.
Specify -0 to turn off all default settings
If you specify multiple 'paired-end' inputs, then a -o option is
required for each. IE: -o read1.clip.q -o read2.clip.fq
Options:
-h This help
-o FIL Output file (stats to stdout)
-s N.N Log scale for adapter minimum-length-match (2.2)
-t N % occurance threshold before adapter clipping (0.25)
-m N Minimum clip length, overrides scaled auto (1)
-p N Maximum adapter difference percentage (10)
-l N Minimum remaining sequence length (19)
-L N Maximum remaining sequence length (none)
-D N Remove duplicate reads : Read_1 has an identical N bases (0)
-k N sKew percentage-less-than causing cycle removal (2)
-x N 'N' (Bad read) percentage causing cycle removal (20)
-q N quality threshold causing base removal (10)
-w N window-size for quality trimming (1)
-H remove >95% homopolymer reads (no)
-X remove low complexity reads (no)
-0 Set all default parameters to zero/do nothing
-U|u Force disable/enable Illumina PF filtering (auto)
-P N Phred-scale (auto)
-R Don't remove N's from the fronts/ends of reads
-n Don't clip, just output what would be done
-C N Number of reads to use for subsampling (300k)
-S Save all discarded reads to '.skip' files
-d Output lots of random debugging stuff
Quality adjustment options:
--cycle-adjust CYC,AMT Adjust cycle CYC (negative = offset from end) by amount AMT
--phred-adjust SCORE,AMT Adjust score SCORE by amount AMT
--phred-adjust-max SCORE Adjust scores > SCORE to SCOTE
Filtering options*:
--[mate-]qual-mean NUM Minimum mean quality score
--[mate-]qual-gt NUM,THR At least NUM quals > THR
--[mate-]max-ns NUM Maxmium N-calls in a read (can be a %)
--[mate-]min-len NUM Minimum remaining length (same as -l)
--homopolymer-pct PCT Homopolymer filter percent (95)
--lowcomplex-pct PCT Complexity filter percent (95)
If mate- prefix is used, then applies to second non-barcode read only
Adapter files are 'fasta' formatted:
Specify n/a to turn off adapter clipping, and just use filters
Increasing the scale makes recognition-lengths longer, a scale
of 100 will force full-length recognition of adapters.
Adapter sequences with _5p in their label will match 'end's,
and sequences with _3p in their label will match 'start's,
otherwise the 'end' is auto-determined.
Skew is when one cycle is poor, 'skewed' toward a particular base.
If any nucleotide is less than the skew percentage, then the
whole cycle is removed. Disable for methyl-seq, etc.
Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done
for miRNA, amplicon and other low-complexity situations!)
Duplicate read filtering is appropriate for assembly tasks, and
never when read length < expected coverage. -D 50 will use
4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.
*Quality filters are evaluated after clipping/trimming
Homopolymer filtering is a subset of low-complexity, but will not
be separately tracked unless both are turned on.
• tophat (TopHat 2.0.13 release 10/2/2014) [15]', a tool for mapping split reads to a reference genome
Usage:
tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] \
[quals1,[quals2,...]] [quals1[,quals2,...]]
Options:
-v/--version
-o/--output-dir <string> [ default: ./tophat_out ]
--bowtie1 [ default: bowtie2 ]
-N/--read-mismatches <int> [ default: 2 ]
--read-gap-length <int> [ default: 2 ]
--read-edit-dist <int> [ default: 2 ]
--read-realign-edit-dist <int> [ default: "read-edit-dist" + 1 ]
-a/--min-anchor <int> [ default: 8 ]
-m/--splice-mismatches <0-2> [ default: 0 ]
-i/--min-intron-length <int> [ default: 50 ]
-I/--max-intron-length <int> [ default: 500000 ]
-g/--max-multihits <int> [ default: 20 ]
--suppress-hits
-x/--transcriptome-max-hits <int> [ default: 60 ]
-M/--prefilter-multihits ( for -G/--GTF option, enable
an initial bowtie search
against the genome )
--max-insertion-length <int> [ default: 3 ]
--max-deletion-length <int> [ default: 3 ]
--solexa-quals
--solexa1.3-quals (same as phred64-quals)
--phred64-quals (same as solexa1.3-quals)
-Q/--quals
--integer-quals
-C/--color (Solid - color space)
--color-out
--library-type <string> (fr-unstranded, fr-firststrand,
fr-secondstrand)
-p/--num-threads <int> [ default: 1 ]
-R/--resume <out_dir> ( try to resume execution )
-G/--GTF <filename> (GTF/GFF with known transcripts)
--transcriptome-index <bwtidx> (transcriptome bowtie index)
-T/--transcriptome-only (map only to the transcriptome)
-j/--raw-juncs <filename>
--insertions <filename>
--deletions <filename>
-r/--mate-inner-dist <int> [ default: 50 ]
--mate-std-dev <int> [ default: 20 ]
--no-novel-juncs
--no-novel-indels
--no-gtf-juncs
--no-coverage-search
--coverage-search
--microexon-search
--keep-tmp
--tmp-dir <dirname> [ default: <output_dir>/tmp ]
-z/--zpacker <program> [ default: gzip ]
-X/--unmapped-fifo [use mkfifo to compress more temporary
files for color space reads]
Advanced Options:
--report-secondary-alignments
--no-discordant
--no-mixed
--segment-mismatches <int> [ default: 2 ]
--segment-length <int> [ default: 25 ]
--bowtie-n [ default: bowtie -v ]
--min-coverage-intron <int> [ default: 50 ]
--max-coverage-intron <int> [ default: 20000 ]
--min-segment-intron <int> [ default: 50 ]
--max-segment-intron <int> [ default: 500000 ]
--no-sort-bam (Output BAM is not coordinate-sorted)
--no-convert-bam (Do not output bam format.
Output is <output_dir>/accepted_hit.sam)
--keep-fasta-order
--allow-partial-mapping
Bowtie2 related options:
Preset options in --end-to-end mode (local alignment is not used in TopHat2)
--b2-very-fast
--b2-fast
--b2-sensitive
--b2-very-sensitive
Alignment options
--b2-N <int> [ default: 0 ]
--b2-L <int> [ default: 20 ]
--b2-i <func> [ default: S,1,1.25 ]
--b2-n-ceil <func> [ default: L,0,0.15 ]
--b2-gbar <int> [ default: 4 ]
Scoring options
--b2-mp <int>,<int> [ default: 6,2 ]
--b2-np <int> [ default: 1 ]
--b2-rdg <int>,<int> [ default: 5,3 ]
--b2-rfg <int>,<int> [ default: 5,3 ]
--b2-score-min <func> [ default: L,-0.6,-0.6 ]
Effort options
--b2-D <int> [ default: 15 ]
--b2-R <int> [ default: 2 ]
Fusion related options:
--fusion-search
--fusion-anchor-length <int> [ default: 20 ]
--fusion-min-dist <int> [ default: 10000000 ]
--fusion-read-mismatches <int> [ default: 2 ]
--fusion-multireads <int> [ default: 2 ]
--fusion-multipairs <int> [ default: 2 ]
--fusion-ignore-chromosomes <list> [ e.g, <chrM,chrX> ]
--fusion-do-not-resolve-conflicts [this is for test purposes ]
SAM Header Options (for embedding sequencing run metadata in output):
--rg-id <string> (read group ID)
--rg-sample <string> (sample ID)
--rg-library <string> (library ID)
--rg-description <string> (descriptive string, no tabs allowed)
--rg-platform-unit <string> (e.g Illumina lane ID)
--rg-center <string> (sequencing center name)
--rg-date <string> (ISO 8601 date of the sequencing run)
--rg-platform <string> (Sequencing platform descriptor)
for detailed help see http://tophat.cbcb.umd.edu/manual.html
• samtools (v1.2.1 using htslib) [16], THE canonical SAM/BAM manipulation suite
Version: 1.2 (using htslib 1.2.1)
Usage: samtools <command> [options]
Commands:
-- indexing
faidx index/extract FASTA
index index alignment
-- editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
rmdup remove PCR duplicates
targetcut cut fosmid regions (for fosmid pool only)
-- file operations
bamshuf shuffle and group alignments by name
cat concatenate BAMs
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
bam2fq converts a BAM to a FASTQ
-- stats
bedcov read depth per BED region
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)
-- viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
• Picard (1.119) [17], the 'broadly' used java wrapper for samtools with added functionalities
AddCommentsToBam.jar, AddOrReplaceReadGroups.jar, BamIndexStats.jar, BamToBfq.jar, BuildBamIndex.jar,
CalculateHsMetrics.jar, CheckIlluminaDirectory.jar, CleanSam.jar, CollectAlignmentSummaryMetrics.jar,
CollectGcBiasMetrics.jar, CollectInsertSizeMetrics.jar, CollectMultipleMetrics.jar, CollectRnaSeqMetrics.jar,
CollectTargetedPcrMetrics.jar, CollectWgsMetrics.jar, CompareSAMs.jar, CreateSequenceDictionary.jar,
DownsampleSam.jar, EstimateLibraryComplexity.jar, ExtractIlluminaBarcodes.jar, ExtractSequences.jar,
FastqToSam.jar, FilterSamReads.jar, FixMateInformation.jar, GatherBamFiles.jar, IlluminaBasecallsToFastq.jar,
IlluminaBasecallsToSam.jar, IntervalListTools.jar, MakeSitesOnlyVcf.jar, MarkDuplicates.jar, MarkIlluminaAdapters.jar,
MeanQualityByCycle.jar, MergeBamAlignment.jar, MergeSamFiles.jar, MergeVcfs.jar, NormalizeFasta.jar,
QualityScoreDistribution.jar, ReorderSam.jar, ReplaceSamHeader.jar, RevertOriginalBaseQualitiesAndAddMateCigar.jar,
RevertSam.jar, SamFormatConverter.jar, SamToFastq.jar, SortSam.jar, SplitVcfs.jar, ValidateSamFile.jar,
VcfFormatConverter.jar, ViewSam.jar
## get help in Picard by typing
java -jar $PICARD <command.jar> -h
• RSeQC (v2.6.1) [18] package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data.
bam2wig.py
bam2fq.py
bam_stat.py
clipping_profile.py
geneBody_coverage.py
geneBody_coverage2.py
infer_experiment.py
inner_distance.py
junction_annotation.py
junction_saturation.py
overlay_bigwig.py
normalize_bigwig.py
read_distribution.py
read_duplication.py
read_GC.py
read_NVC.py
read_quality.py
read_hexamer.py
RPKM_count.py
RPKM_saturation.py
spilt_bam.py
## help for a module is generally obtained by typing the module name followed by 'enter', eg: 'bam2wig.py'
• HTSeq (v0.6.1) [19] is a Python package that provides infrastructure to process data from high-throughput sequencing assays (used here to summarize read counts per gene)
## Quality Assessment with 'htseq-qa'
Given a FASTQ or SAM file, this script produces a PDF file with plots depicting the base calls and base-call qualities by position in the read. This is useful to assess the technical quality of a sequencing run.
Usage: htseq-qa [options] read_file
This script take a file with high-throughput sequencing reads (supported
formats: SAM, Solexa _export.txt, FASTQ, Solexa _sequence.txt) and performs a
simply quality assessment by producing plots showing the distribution of
called bases and base-call quality scores by position within the reads. The
plots are output as a PDF file.
Options:
-h, --help show this help message and exit
-t TYPE, --type=TYPE type of read_file (one of: sam [default], bam, solexa-
export, fastq, solexa-fastq)
-o OUTFILE, --outfile=OUTFILE
output filename (default is <read_file>.pdf)
-r READLEN, --readlength=READLEN
the maximum read length (when not specified, the
script guesses from the file
-g GAMMA, --gamma=GAMMA
the gamma factor for the contrast adjustment of the
quality score plot
-n, --nosplit do not split reads in unaligned and aligned ones
-m MAXQUAL, --maxqual=MAXQUAL
the maximum quality score that appears in the data
(default: 41)
Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology
Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
Public License v3. Part of the 'HTSeq' framework, version 0.6.1p1.
## Counting reads in features with 'htseq-count' (++ USED IN THIS TRAINING)
Given a SAM file with alignments and a GFF file with genomic features, this script counts how many reads map to each feature.
Usage: htseq-count [options] alignment_file gff_file
This script takes an alignment file in SAM/BAM format and a feature file in
GFF format and calculates for each feature the number of reads mapping to it.
See http://www-huber.embl.de/users/anders/HTSeq/doc/count.html for details.
Options:
-h, --help show this help message and exit
-f SAMTYPE, --format=SAMTYPE
type of <alignment_file> data, either 'sam' or 'bam'
(default: sam)
-r ORDER, --order=ORDER
'pos' or 'name'. Sorting order of <alignment_file>
(default: name). Paired-end sequencing data must be
sorted either by position or by read name, and the
sorting order must be specified. Ignored for single-
end data.
-s STRANDED, --stranded=STRANDED
whether the data is from a strand-specific assay.
Specify 'yes', 'no', or 'reverse' (default: yes).
'reverse' means 'yes' with reversed strand
interpretation
-a MINAQUAL, --minaqual=MINAQUAL
skip all reads with alignment quality lower than the
given minimum value (default: 10)
-t FEATURETYPE, --type=FEATURETYPE
feature type (3rd column in GFF file) to be used, all
features of other type are ignored (default, suitable
for Ensembl GTF files: exon)
-i IDATTR, --idattr=IDATTR
GFF attribute to be used as feature ID (default,
suitable for Ensembl GTF files: gene_id)
-m MODE, --mode=MODE mode to handle reads overlapping more than one feature
(choices: union, intersection-strict, intersection-
nonempty; default: union)
-o SAMOUT, --samout=SAMOUT
write out all SAM alignment records into an output SAM
file called SAMOUT, annotating each line with its
feature assignment (as an optional field with tag
'XF')
-q, --quiet suppress progress report
Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology
Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
Public License v3. Part of the 'HTSeq' framework, version 0.6.1p1.
• Qualimap (v.2.0.2)[20][21], a Java suite for QC of mapping data.
Built on 2015-02-10 11:39
usage: qualimap <tool> [options]
To launch GUI leave <tool> empty.
Available tools:
bamqc Evaluate NGS mapping to a reference genome
rnaseq Evaluate RNA-seq alignment data
counts Counts data analysis (further RNA-seq data evaluation)
multi-bamqc Compare QC reports from multiple NGS mappings
clustering Cluster epigenomic signals
comp-counts Compute feature counts
Special arguments:
--java-mem-size Use this argument to set Java memory heap size. Example:
qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
• Deeptools [22], a python toolbox able to manipulate and compare mapping data (also available as a live Galaxy server[23])
deepTools offers multiple methods for highly-customizable data visualization that immensely aid hypothesis generation and data interpretation. It also offers all the tools needed to create coverage files in standard bedGraph and bigWig file formats allowing various normalization procedures and comparisons between two files (for example, treatment and control).
Installation page: https://github.com/fidelram/deepTools/wiki/Installing-deepTools
• R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"[24] - RStudio (Version 0.98.1102)[25] - and Bioconductor (v3.0)[26] packages, THE statistical Biocomputing tool-suite
Statistic analysis requires knowledge and tools for which simple programs do exist (eg Prism provided by VIB). However, when it comes to high throughput data analysis, such simple programs are not anymore usable and advanced programming becomes necessary.
[R] is the language of choice to perform complex tasks on data and has long been adopted as the main language for BioStatistics, often under the name of BioConductor. [R] deserves several dedicated training courses for in-depth learning which is not the object of this session. You can obtain the [R] programming software and a lot of training material from the official web-site
## RStudio is our preferred R-GUI as it is in continuous development and allows documenting the R-code and produce beautiful reports
RStudio is available as client and server free versions
## A number of pre-installed R packages are not listed here.
## More specific packages used in tis training and dedicated to RNASeq analysis include:
* EdgeR <ref>http://www.bioconductor.org/packages/release/bioc/html/edgeR.html</ref>
* DESeq <ref>http://www.bioconductor.org/packages/release/bioc/html/DESeq.html</ref>
* DESeq2 <ref>http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html</ref>
* ...
Version number reported for each tool correspond to the lowest version used to prepare this training and may differ from the current version. Older versions may as well do!
Additional tools were installed from downloads or from source and are referenced to web links in the respective pages
Complementary Learning: Recommended and Related Readings
Get HELP from the community
There are naive questions, tedious questions, ill-phrased questions, questions put after inadequate self-criticism. But every question is a cry to understand the world. There is no such thing as a dumb question.
- SeqAnswers[27] for all what relates to NGS
- BioStar[28] for questions about biocomputing and scripting for biologists
- stackoverflow[29] for questions related to coding.
- OMICSTools is a great place to find software and tools dedicated to a given Omic field.
new NGS-CLI players in town
The Tophat team just published an improved RNASeq splice mapper called HISAT and its new version HISAT2 that should replace Tophat while keeping similar features. Although still in its Beta version, you may find useful to test this new mapper that requires much less RAM to hold the reference genome index (28GB for STAR and only 4.3GB for HISAT which will match many middle-end desktops and high-end laptops). The speed of mapping is reported equal or slightly higher than for STAR which is today the fastest mapper. The accuracy is claimed as good as the bests. For more info, please read their paper in Nature Methods of March (doi:10.1038/nmeth.3317[30]).
A quite interesting NextGenSeek page is dedicated to recent tools that will likely complement or replace the historical Tophat workflow ([31]).
Other tools are arising. Have a look fro instance to Kallisto which makes use of pseudoalignments to further speed up the mapping of reads to transcripts.
NGS-tools and GUIs
A non-exhaustive list is provided here to propose alternatives to the CLI protocol presented today. Feel free to try these at home and maybe prefer them to the power of command-line.
- The CLC Genomic workbench, as well as comparable commercial software, offers a reasonably complete, performant, and well documented suite of tools dedicated to NGS data analysis (DNA, RNA, ChIP-Seq, ...). The CLC-GWB is available within VIB with a set of dynamic licenses and additional licenses can be acquired by research groups with need for such platforms (please contact us). Other similar commercial solutions are most likely attractive but not supported by BITS-VIB and left to the evaluation by motivated users (one such GUI platform is Partek which we tried in the past with success but comes at a significant price).
- UGene offers a large number of tools integrated under a common GUI. It also allows building workflows by linking tools similarly to what Galaxy offers. A demo webcast is available on youtube.
- RobiNA has now been available for several years and is still efficient in analyzing data from reads or at a later point from the count files obtained from HTSeq. You can use it to analyze counts generated in this session and applying DESeq (v1) or EdgeR packages without having to code in R. (RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics [32])
- SeqMonk [33] has been around for quite some time and extends to other sequencing applications.
- PRADA: is a recently published python pipeline for RNA sequencing data analysis that offers many functionalities[34]
- RNASeqGUI: is an integrated GUI to several R-packages that can be used for RNA-Seq data analysis and reduces the amount of required R code[35]
Functional analysis of the DE gene lists
Obtaining gene counts is only the first step for Biologists who need biological interpretation of the transcriptomics data. Functional analysis is the most frequently used step to answer this need and can be done using a variety of software tools. A recent review paper: 'Ten years of pathway analysis: current approaches and outstanding challenges' [36] provides links to many such tools. Other solutions are listed below for user convenience.
Core applications
- IPA - Ingenuity Pathway Analysis is the best solution for VIB scientists (commercial).
- GSEA [37] offers advantages over other methods but the Broad GUI is not very easy to use.
Web-based servers
Some of the following tools offer many alternative analyses context and will prove very useful to get a first insight in your data.
- lrpath uses a new algoritm very recently published [38] that takes both logFC and pValues into account without fixing limits [39]
- Enrichr is very flexible and quite complete [40]
- WebGestalt WEB-based GEne SeT AnaLysis Toolkit is also quite attractive [41]
- GeneMania was recently published [42] and deserves attention.
- Biomart now includes a very easy to use enrichment tool to which ENSG (gene-IDs) can be fed to identify enriched classes (give it a try!).
- DAVID [43] remains the classical reference for enrichment analysis but shows a quite outdated interface
[R] packages for functional annotation and enrichment analysis
Again, [R] / Bioconductor offers many solutions. A few have been introduced in previous trainings including:
- piano [44]
- other packages are found by searching the Bioconductor site
Reviews about RNASeq analyses methods
We provide below a non-exhausting list of reviews and comparative papers. Some of these are already outdated due to the ongoing revolution in the field. Others present strengths and weaknesses of the different tools and approaches. You are welcome to indicate unlisted papers that you consider relevant to extend this list.
- A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis [45]. Also read from the rna-seqblog page [46] dedicated to this heavily discussed topic.
- Systematic evaluation of spliced alignment programs for RNA-seq data [47]
- Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data [48]
- The EBI online training pages propose training sessions with slides and exercises.
Conclusion
This hands-on training session was aimed at presenting a simplified yet complete workflow for NGS RNASeq DE analysis using REAL and FULL sized data (all reads from a human shorter chromosome) and to train users in taking advantage of the many command-line interface resources available to the research community.
The number of research groups generating NGS data is increasing upon time and the demand for data analysis is getting greater than what core facilities can provide and often requires biological background knowledge not available to the facility team. For these reason, more and more students and staff scientists feel the 'urgent need' to perform their own NGS data analysis.
Although a more decent level of computer infrastructure is preferred in order to perform this analysis in ideal conditions, today's main limiting factors remain the choice of adequate tools and be trained to use them properly. This session hopefully demonstrated that a minimum of training is sufficient to reproduce a standard NGS analysis workflow on a broadly available laptop computer. For higher throughput and more complex analysis, additional learning and more resources will be required but a lot can be initiated using the many available online resources and community support.
Your feedback to this two-days general command-line based NGS data analysis session is very important to us and will be used to improve this content for later sessions. If you need more training of the kind, please contact us and we will organize additional hands-on based sessions on your requests. More advance sessions will depend on the availability of expert users within VIB that will accept to prepare specified material.
PS: We tried to cite resources and people that made this training possible. If you feel that some citations are missing, please inform us and we will correct this omission
References:
- ↑ http://www.ncbi.nlm.nih.gov/sra?term=SRP012167
- ↑ http://wiki.bits.vib.be/index.php/Introduction_to_Linux_for_bioinformatics
- ↑ http://wiki.bits.vib.be/index.php/NGS_data_analysis
- ↑ https://github.com/BITS-VIB/RNAseq-training
- ↑ http://www.bioconductor.org/help/workflows/rnaseqGene/
- ↑ http://www.bioconductor.org/packages/release/bioc/html/edgeR.html
- ↑ http://mapman.gabipd.org/web/guest/robin
- ↑ http://www.gnu.org/software/coreutils/manual/coreutils.pdf
- ↑ http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- ↑ https://code.google.com/p/cutadapt/wiki/documentation, http://journal.embnet.org/index.php/embnetjournal/article/view/200
- ↑
Robert Schmieder, Robert Edwards
Quality control and preprocessing of metagenomic datasets.
Bioinformatics: 2011, 27(6);863-4
[PubMed:21278185] ##WORLDCAT## [DOI] (I p) - ↑ http://prinseq.sourceforge.net
- ↑
Daniel Blankenberg, Assaf Gordon, Gregory Von Kuster, Nathan Coraor, James Taylor, Anton Nekrutenko, Galaxy Team
Manipulation of FASTQ data with Galaxy.
Bioinformatics: 2010, 26(14);1783-5
[PubMed:20562416] ##WORLDCAT## [DOI] (I p) - ↑ https://code.google.com/p/ea-utils/
- ↑ http://ccb.jhu.edu/software/tophat/index.shtml
- ↑ http://samtools.sourceforge.net
- ↑ http://picard.sourceforge.net
- ↑ http://rseqc.sourceforge.net
- ↑ http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
- ↑
Fernando García-Alcalde, Konstantin Okonechnikov, José Carbonell, Luis M Cruz, Stefan Götz, Sonia Tarazona, Joaquín Dopazo, Thomas F Meyer, Ana Conesa
Qualimap: evaluating next-generation sequencing alignment data.
Bioinformatics: 2012, 28(20);2678-9
[PubMed:22914218] ##WORLDCAT## [DOI] (I p) - ↑ http://qualimap.bioinfo.cipf.es
- ↑ https://deeptools.github.io/
- ↑ http://deeptools.ie-freiburg.mpg.de/
- ↑ http://cran.r-project.org
- ↑ http://www.rstudio.com
- ↑ http://www.bioconductor.org
- ↑ http://seqanswers.com SeqAnswers
- ↑ http://www.biostars.org BioStar
- ↑ http://stackoverflow.com stackoverflow
- ↑ http://www.nature.com/doifinder/10.1038/nmeth.3317
- ↑ http://nextgenseek.com/2015/03/three-papers-on-new-rna-seq-methods-offer-a-new-way-to-do-rna-seq-analysis/
- ↑
Marc Lohse, Anthony M Bolger, Axel Nagel, Alisdair R Fernie, John E Lunn, Mark Stitt, Björn Usadel
RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics.
Nucleic Acids Res: 2012, 40(Web Server issue);W622-7
[PubMed:22684630] ##WORLDCAT## [DOI] (I p) - ↑ http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/
- ↑
Wandaliz Torres-García, Siyuan Zheng, Andrey Sivachenko, Rahulsimham Vegesna, Qianghu Wang, Rong Yao, Michael F Berger, John N Weinstein, Gad Getz, Roel G W Verhaak
PRADA: pipeline for RNA sequencing data analysis.
Bioinformatics: 2014, 30(15);2224-6
[PubMed:24695405] ##WORLDCAT## [DOI] (I p) - ↑
Francesco Russo, Claudia Angelini
RNASeqGUI: a GUI for analysing RNA-Seq data.
Bioinformatics: 2014, 30(17);2514-6
[PubMed:24812338] ##WORLDCAT## [DOI] (I p) - ↑
Purvesh Khatri, Marina Sirota, Atul J Butte
Ten years of pathway analysis: current approaches and outstanding challenges.
PLoS Comput Biol: 2012, 8(2);e1002375
[PubMed:22383865] ##WORLDCAT## [DOI] (I p) - ↑ http://www.broadinstitute.org/gsea/
- ↑
Chee Lee, Snehal Patil, Maureen A Sartor
RNA-Enrich: a cut-off free functional enrichment testing method for RNA-seq with improved detection power.
Bioinformatics: 2016, 32(7);1100-2
[PubMed:26607492] ##WORLDCAT## [DOI] (I p) - ↑ http://lrpath.ncibi.org/
- ↑ http://amp.pharm.mssm.edu/Enrichr/
- ↑ http://bioinfo.vanderbilt.edu/webgestalt/
- ↑
Franck Rapaport, Raya Khanin, Yupu Liang, Mono Pirun, Azra Krek, Paul Zumbo, Christopher E Mason, Nicholas D Socci, Doron Betel
Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data.
Genome Biol: 2013, 14(9);R95
[PubMed:24020486] ##WORLDCAT## [DOI] (I p)Sara Mostafavi, Debajyoti Ray, David Warde-Farley, Chris Grouios, Quaid Morris
GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function.
Genome Biol: 2008, 9 Suppl 1(Suppl 1);S4
[PubMed:18613948] ##WORLDCAT## [DOI] (I p) - ↑ http://david.abcc.ncifcrf.gov
- ↑ http://www.bioconductor.org/packages/release/bioc/html/piano.html
- ↑
Marie-Agnès Dillies, Andrea Rau, Julie Aubert, Christelle Hennequet-Antier, Marine Jeanmougin, Nicolas Servant, Céline Keime, Guillemette Marot, David Castel, Jordi Estelle, Gregory Guernec, Bernd Jagla, Luc Jouneau, Denis Laloë, Caroline Le Gall, Brigitte Schaëffer, Stéphane Le Crom, Mickaël Guedj, Florence Jaffrézic, French StatOmique Consortium
A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.
Brief Bioinform: 2013, 14(6);671-83
[PubMed:22988256] ##WORLDCAT## [DOI] (I p) - ↑ http://www.rna-seqblog.com/which-method-should-you-use-for-normalization-of-rna-seq-data/
- ↑
Pär G Engström, Tamara Steijger, Botond Sipos, Gregory R Grant, André Kahles, Gunnar Rätsch, Nick Goldman, Tim J Hubbard, Jennifer Harrow, Roderic Guigó, Paul Bertone, RGASP Consortium
Systematic evaluation of spliced alignment programs for RNA-seq data.
Nat Methods: 2013, 10(12);1185-91
[PubMed:24185836] ##WORLDCAT## [DOI] (I p) - ↑
Franck Rapaport, Raya Khanin, Yupu Liang, Mono Pirun, Azra Krek, Paul Zumbo, Christopher E Mason, Nicholas D Socci, Doron Betel
Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data.
Genome Biol: 2013, 14(9);R95
[PubMed:24020486] ##WORLDCAT## [DOI] (I p)
[ Main_Page | NGS_data_analysis ]