Hands-on introduction to NGS RNASeq DE analysis

From BITS wiki
Jump to: navigation, search

# last edit, April 24 2015

NGS_RNASeqDE-training2015_cover.png

[ Main_Page | NGS_data_analysis ]
Hands-on introduction to NGS RNASeq DE analysis-laptop-file-list
PDF handout of these pages (might contain outdated information!)

Aims of the NGS RNASeq differential expression analysis 2-days session

Using publicly available Illumina RNASeq samples to:

  • Walk through a complete analysis workflow including read QC, read mapping to the human hg19 genome, and counting the reads mapping to each human transcript.
  • Perform differential expression analysis using several popular R/Bioconductor tools and obtain gene-level data ready for functional analysis and biological interpretation.


full_wf-2015.png

The skills acquired during this session should allow reproduction of the workflow on a reasonably-sized personal computer running Unix (including mac OSX)

Technical.png All scripts as well as the most important result files are put online to allow offline practice to trainees and web readers

Not covered during this training

This RNASeq basic training will fit the needs of all scientists who wish to compare pairwise experimental conditions to identify differentially transcribed genes and relate their results to a pre-defined phenotype. Other HT-Seq applications including the analysis of Gene regulation at the chromatin level, by transcription modulation or the transcript regulation by ncRNA or miRNA are not covered here.

All CLI applications used during the session are free for research use and can be installed on a personal computer with a minimum of hassle, we will be happy to support you in the installation of the necessary tools upon request.

Technical.png This session does not use the popular 'Galaxy' system because of the absence of a dedicated server and of the time overhead associated with data input and output under this system. We nevertheless recommend using Galaxy if your aim is to practice NGS tools at small scale on a desktop computer used by you and few colleagues

  • This training is restricted to contrasts between pairwise experimental conditions. Multiple Anova or other global statistical methods will not be used here.
  • This protocol is not valid to analyze time series experiments
  • This protocol is not valid for dose response analysis
  • This protocol is not valid to analyze survival effects associated with gene expression
  • This protocol does not de-novo assemble reads but instead maps reads to a known transcriptome model (it is therefore not directly applicable to organisms for which no transcriptome is available)
  • Although this workflow uses transcript information during the mapping, this protocol does not consider splice variants and maps all results to Gene models for further functional analysis. Splice variant analysis is still in its infancy today and is limited by the read-length not allowing yet to sequence full transcripts in one piece and by the very little known about specific transcript isoforms at functional level (protein).

For each of these additional applications, specific software and approaches are required that fall far beyond the scope of this introductory training session.

  • The training stops with the definition of differentially expressed genes and does not include downstream biological interpretation of the results at the functional level. For the latest, please refer to follow-up sessions covering the use of Ingenuity Pathway Analysis or related tools.

More BITS Training Info

Summary

This training gives an introduction to the use of popular NGS analysis software packages at the command line tools for more advanced Users or Users having needs exceeding the capacity of the defunct BITS Galaxy platform. All code fragments used in the training pages can easily be recycled and adapted, and will constitute the basis of your own workflows. This training does not cover all currently available methods. It does not aim at bringing users to a professional NGS analyst level but provides enough information to allow motivated biologists understand what DNA sequencing practically is, to read and reuse command-line code, and when necessary to communicate knowingly with NGS experts for more in-depth needs.

Technical.png The sequencing data used in this session was obtained from the SRA under the reference SRP012167 [1]. The dedicated page Download read information and FASTQ data from the SRA details how to download and reformat data from SRA (ENA) and can be used as guide to obtain other published datasets

Prerequisites

Skills required to follow this training:

  • Linux command line basic skills are absolutely required to understand the code used throughout the CLI part of this session
  • Basic knowledge of human genome structure, transcriptome structure, and nomenclature is necessary to evaluate and understand the training steps
  • Basic knowledge of Illumina NGS read structure is also required for the same reason

People who have no experience with Linux command line should first follow the 'Linux for bioinformatics' training​ ([2]) or they will slow down the full class. All attendees who lack knowledge about Illumina reads should follow the preliminary one day 'Introduction to the analysis of NGS data' training ([3]). Self-training on Galaxy is advised to take full advantage on this session (please refer to the BITS Galaxy pages).  

Hands-On Training

Housekeeping rules

  • All Command Line Interface exercises were performed by the trainer on a similar laptops and the results moved to a new subfolder called Final_Results. Your own results will be stored in similarly named folders but created directly under your working folder and should not overwrite the 'documented' version stored in Final_Results.
  • Additionally, key results were copied on the BITS server to allow you practice the sessions at later points in time on your own computer and for our web visitors. These measures ensure that you can take up any exercise and will find the necessary input data in the corresponding folder even if you did not perform the previous exercises.
  • Finally, key result files were copied on the BITS file server and can be downloaded from links present at the bottom of each exercise page.

 

The Exercises

Exercise pages below are split into two groups. i) Exercises performed by the trainer to prepare today's data. These steps do not need be repeated during our sessions and are quite lengthy but will be necessary for visitors who want to reproduce this training on their own machine. ii) exercises repeated during the session and running faster to make you feel the real job. Note that we do not detail installation of all tools which represents in itself a small challenge for the apprentice but can be achieved with some efforts and after reading carefully the instructions associated with each package.

Code developed in these exercises

Most scripts included in the exercises below are reproduced and maintained on our dedicated GitHub page https://github.com/BITS-VIB/RNAseq-training [4] (please feed-us back about any issue with this page).

Required to prepare data for this session but not covered

In the upper exercises, the original read data was downloaded, cleaned, and mapped to the reference genome. Mapping address of the obtained BAM data was used to extract all reads aligning with chr22. This small subset of the original experiment was prepared as input material for the remaining of the hands-on session detailed below.

Hands-On Session exercises

Differential expression analysis in R-bioconductor and using GUI apps

 

Training Material

 

Handicon.png Most of the content produced during today's session was added on the BITS platform (see links at bottom of each exercise page)

Today's main tools

In addition to the many built-in unix command, originating from the GNU consortium for a good part (please read the GNU coreutils manual for more [8]), we will cite or use today several dedicated toolboxes listed below in order of appearance. Version numbers are indicative and may have progressed since the last edit of this page.

• fastQC (v0.11.2)[9], 'the' tool for reads QC

FastQC v0.11.2
splaisan@stelap:/work/TUTORIALS/NGS_RNASeqDE-training2015/final_test$ fastQC --help

            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

        fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
           [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of
    which will help to identify a different potential type of problem in your
    data.
   
    If no files to process are specified on the command line then the program
    will start as an interactive graphical application.  If files are provided
    on the command line then the program will run with no user interaction
    required.  In this mode it is suitable for inclusion into a standardised
    analysis pipeline.
   
    The options for the program as as follows:
   
    -h --help       Print this help file and exit
   
    -v --version    Print the version of the program and exit
   
    -o --outdir     Create all output files in the specified output directory.
                    Please note that this directory must exist as the program
                    will not create it.  If this option is not set then the
                    output file for each sequence file is created in the same
                    directory as the sequence file which was processed.
                   
    --casava        Files come from raw casava output. Files in the same sample
                    group (differing only by the group number) will be analysed
                    as a set rather than individually. Sequences with the filter
                    flag set in the header will be excluded from the analysis.
                    Files must have the same names given to them by casava
                    (including being gzipped and ending with .gz) otherwise they
                    won't be grouped together correctly.
                   
    --nofilter      If running with --casava then don't remove read flagged by
                    casava as poor quality when performing the QC analysis.
                   
    --extract       If set then the zipped output file will be uncompressed in
                    the same directory after it has been created.  By default
                    this option will be set if fastqc is run in non-interactive
                    mode.
                   
    -j --java       Provides the full path to the java binary you want to use to
                    launch fastqc. If not supplied then java is assumed to be in
                    your path.
                   
    --noextract     Do not uncompress the output file after creating it.  You
                    should set this option if you do not wish to uncompress
                    the output when running in non-interactive mode.
                   
    --nogroup       Disable grouping of bases for reads >50bp. All reports will
                    show data for every base in the read.  WARNING: Using this
                    option will cause fastqc to crash and burn if you use it on
                    really long reads, and your plots may end up a ridiculous size.
                    You have been warned!
                   
    -f --format     Bypasses the normal sequence file format detection and
                    forces the program to use the specified format.  Valid
                    formats are bam,sam,bam_mapped,sam_mapped and fastq
                   
    -t --threads    Specifies the number of files which can be processed
                    simultaneously.  Each thread will be allocated 250MB of
                    memory so you shouldn't run more threads than your
                    available memory will cope with, and not more than
                    6 threads on a 32 bit machine
                 
    -c              Specifies a non-default file which contains the list of
    --contaminants  contaminants to screen overrepresented sequences against.
                    The file must contain sets of named contaminants in the
                    form name[tab]sequence.  Lines prefixed with a hash will
                    be ignored.

    -a              Specifies a non-default file which contains the list of
    --adapters      adapter sequences which will be explicity searched against
                    the library. The file must contain sets of named adapters
                    in the form name[tab]sequence.  Lines prefixed with a hash
                    will be ignored.
                   
    -l              Specifies a non-default file which contains a set of criteria
    --limits        which will be used to determine the warn/error limits for the
                    various modules.  This file can also be used to selectively
                    remove some modules from the output all together.  The format
                    needs to mirror the default limits.txt file found in the
                    Configuration folder.
                   
   -k --kmers       Specifies the length of Kmer to look for in the Kmer content
                    module. Specified Kmer length must be between 2 and 10. Default
                    length is 7 if not specified.
                   
   -q --quiet       Supress all progress messages on stdout and only report errors.
   
   -d --dir         Selects a directory to be used for temporary files written when
                    generating report images. Defaults to system temp directory if
                    not specified.
                   
BUGS

    Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk
    or in www.bioinformatics.babraham.ac.uk/bugzilla/

• cutadapt (v1.7.1) [10], removes adapter sequences from DNA high-throughput sequencing data.

cutadapt version 1.7.1
Copyright (C) 2010-2014 Marcel Martin <marcel.martin@scilifelab.se>

cutadapt removes adapter sequences from high-throughput sequencing reads.

Usage:
    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
characters are supported. The reverse complement is *not* automatically
searched. All reads from input.fastq will be written to output.fastq with the
adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter
sequences can be given (use further -a options), but only the best-matching
adapter will be removed.

Input may also be in FASTA format. Compressed input and output is supported and
auto-detected from the file name (.gz or .bz2). Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.

Some other available features are:
  * Various other adapter types (5' adapters, "mixed" 5'/3' adapters etc.)
  * Trimming a fixed number of bases
  * Quality trimming
  * Trimming paired-end reads
  * Trimming colorspace reads
  * Filtering reads by various criteria

Use "cutadapt --help" to see all command-line options.
See http://cutadapt.readthedocs.org/ for the full documentation.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -f FORMAT, --format=FORMAT
                        Input file format; can be either 'fasta', 'fastq' or
                        'sra-fastq'. Ignored when reading csfasta/qual files
                        (default: auto-detect from file name extension).

  Options that influence how the adapters are found:
    Each of the following three parameters (-a, -b, -g) can be used
    multiple times and in any combination to search for an entire set of
    adapters of possibly different types. Only the best matching adapter
    is trimmed from each read (but see the --times option). Instead of
    giving an adapter directly, you can also write file:FILE and the
    adapter sequences will be read from the given FILE (which must be in
    FASTA format).

    -a ADAPTER, --adapter=ADAPTER
                        Sequence of an adapter that was ligated to the 3' end.
                        The adapter itself and anything that follows is
                        trimmed. If the adapter sequence ends with the '$'
                        character, the adapter is anchored to the end of the
                        read and only found if it is a suffix of the read.
    -g ADAPTER, --front=ADAPTER
                        Sequence of an adapter that was ligated to the 5' end.
                        If the adapter sequence starts with the character '^',
                        the adapter is 'anchored'. An anchored adapter must
                        appear in its entirety at the 5' end of the read (it
                        is a prefix of the read). A non-anchored adapter may
                        appear partially at the 5' end, or it may occur within
                        the read. If it is found within a read, the sequence
                        preceding the adapter is also trimmed. In all cases,
                        the adapter itself is trimmed.
    -b ADAPTER, --anywhere=ADAPTER
                        Sequence of an adapter that was ligated to the 5' or
                        3' end. If the adapter is found within the read or
                        overlapping the 3' end of the read, the behavior is
                        the same as for the -a option. If the adapter overlaps
                        the 5' end (beginning of the read), the initial
                        portion of the read matching the adapter is trimmed,
                        but anything that follows is kept.
    -e ERROR_RATE, --error-rate=ERROR_RATE
                        Maximum allowed error rate (no. of errors divided by
                        the length of the matching region) (default: 0.1)
    --no-indels         Do not allow indels in the alignments (allow only
                        mismatches). Currently only supported for anchored
                        adapters. (default: allow both mismatches and indels)
    -n COUNT, --times=COUNT
                        Try to remove adapters at most COUNT times. Useful
                        when an adapter gets appended multiple times (default:
                        1).
    -O LENGTH, --overlap=LENGTH
                        Minimum overlap length. If the overlap between the
                        read and the adapter is shorter than LENGTH, the read
                        is not modified. This reduces the no. of bases trimmed
                        purely due to short random adapter matches (default:
                        3).
    --match-read-wildcards
                        Allow IUPAC wildcards in reads (default: False).
    -N, --no-match-adapter-wildcards
                        Do not interpret IUPAC wildcards in adapters.

  Options for filtering of processed reads:
    --discard-trimmed, --discard
                        Discard reads that contain the adapter instead of
                        trimming them. Also use -O in order to avoid throwing
                        away too many randomly matching reads!
    --discard-untrimmed, --trimmed-only
                        Discard reads that do not contain the adapter.
    -m LENGTH, --minimum-length=LENGTH
                        Discard trimmed reads that are shorter than LENGTH.
                        Reads that are too short even before adapter removal
                        are also discarded. In colorspace, an initial primer
                        is not counted (default: 0).
    -M LENGTH, --maximum-length=LENGTH
                        Discard trimmed reads that are longer than LENGTH.
                        Reads that are too long even before adapter removal
                        are also discarded. In colorspace, an initial primer
                        is not counted (default: no limit).
    --no-trim           Match and redirect reads to output/untrimmed-output as
                        usual, but do not remove adapters.
    --mask-adapter      Mask adapters with 'N' characters instead of trimming
                        them.

  Options that influence what gets output to where:
    --quiet             Do not print a report at the end.
    -o FILE, --output=FILE
                        Write modified reads to FILE. FASTQ or FASTA format is
                        chosen depending on input. The summary report is sent
                        to standard output. Use '{name}' in FILE to
                        demultiplex reads into multiple files. (default:
                        trimmed reads are written to standard output)
    -p FILE, --paired-output=FILE
                        Write reads from the paired-end input to FILE.
    --info-file=FILE    Write information about each read and its adapter
                        matches into FILE. See the documentation for the file
                        format.
    -r FILE, --rest-file=FILE
                        When the adapter matches in the middle of a read,
                        write the rest (after the adapter) into FILE.
    --wildcard-file=FILE
                        When the adapter has wildcard bases ('N's), write
                        adapter bases matching wildcard positions to FILE.
                        When there are indels in the alignment, this will
                        often not be accurate.
    --too-short-output=FILE
                        Write reads that are too short (according to length
                        specified by -m) to FILE. (default: discard reads)
    --too-long-output=FILE
                        Write reads that are too long (according to length
                        specified by -M) to FILE. (default: discard reads)
    --untrimmed-output=FILE
                        Write reads that do not contain the adapter to FILE.
                        (default: output to same file as trimmed reads)
    --untrimmed-paired-output=FILE
                        Write the second read in a pair to this FILE when no
                        adapter was found in the first read. Use this option
                        together with --untrimmed-output when trimming paired-
                        end reads. (Default: output to same file as trimmed
                        reads.)

  Additional modifications to the reads:
    -u LENGTH, --cut=LENGTH
                        Remove bases from the beginning or end of each read.
                        If LENGTH is positive, the bases are removed from the
                        beginning of each read. If LENGTH is negative, the
                        bases are removed from the end of each read. This
                        option can be specified twice if the LENGTHs have
                        different signs.
    -q CUTOFF, --quality-cutoff=CUTOFF
                        Trim low-quality bases from 3' ends of reads before
                        adapter removal. The algorithm is the same as the one
                        used by BWA (Subtract CUTOFF from all qualities;
                        compute partial sums from all indices to the end of
                        the sequence; cut sequence at the index at which the
                        sum is minimal) (default: 0)
    --quality-base=QUALITY_BASE
                        Assume that quality values are encoded as
                        ascii(quality + QUALITY_BASE). The default (33) is
                        usually correct, except for reads produced by some
                        versions of the Illumina pipeline, where this should
                        be set to 64. (Default: 33)
    -x PREFIX, --prefix=PREFIX
                        Add this prefix to read names
    -y SUFFIX, --suffix=SUFFIX
                        Add this suffix to read names
    --strip-suffix=STRIP_SUFFIX
                        Remove this suffix from read names if present. Can be
                        given multiple times.
    -c, --colorspace    Colorspace mode: Also trim the color that is adjacent
                        to the found adapter.
    -d, --double-encode
                        When in colorspace, double-encode colors (map
                        0,1,2,3,4 to A,C,G,T,N).
    -t, --trim-primer   When in colorspace, trim primer base and the first
                        color (which is the transition to the first
                        nucleotide)
    --strip-f3          For colorspace: Strip the _F3 suffix of read names
    --maq, --bwa        MAQ- and BWA-compatible colorspace output. This
                        enables -c, -d, -t, --strip-f3 and -y '/1'.
    --length-tag=TAG    Search for TAG followed by a decimal number in the
                        name of the read (description/comment field of the
                        FASTA or FASTQ file). Replace the decimal number with
                        the correct length of the trimmed read. For example,
                        use --length-tag 'length=' to correct fields like
                        'length=123'.
    --no-zero-cap       Do not change negative quality values to zero.
                        Colorspace quality values of -1 would appear as spaces
                        in the output FASTQ file. Since many tools have
                        problems with that, negative qualities are converted
                        to zero when trimming colorspace data. Use this option
                        to keep negative qualities.
    -z, --zero-cap      Change negative quality values to zero. This is
                        enabled by default when -c/--colorspace is also
                        enabled. Use the above option to disable it.

• PRINSEQ (PRINSEQ-lite 0.20.4)[11][12], PReprocessing and INformation of SEQuence data (a great perl tool replacing Galaxy Groomer+FastQ quality trimmer+...+many more options are available)

Usage:
    perl prinseq-lite.pl [-h] [-help] [-version] [-man] [-verbose] [-fastq
    input_fastq_file] [-fasta input_fasta_file] [-fastq2 input_fastq_file_2]
    [-fasta2 input_fasta_file_2] [-qual input_quality_file] [-min_len
    int_value] [-max_len int_value] [-range_len ranges] [-min_gc int_value]
    [-max_gc int_value] [-range_gc ranges] [-min_qual_score int_value]
    [-max_qual_score int_value] [-min_qual_mean int_value] [-max_qual_mean
    int_value] [-ns_max_p int_value] [-ns_max_n int_value] [-noniupac]
    [-seq_num int_value] [-derep int_value] [-derep_min int_value]
    [-lc_method method_name] [-lc_threshold int_value] [-trim_to_len
    int_value] [-trim_left int_value] [-trim_right int_value] [-trim_left_p
    int_value] [-trim_right_p int_value] [-trim_ns_left int_value]
    [-trim_ns_right int_value] [-trim_tail_left int_value] [-trim_tail_right
    int_value] [-trim_qual_left int_value] [-trim_qual_right int_value]
    [-trim_qual_type type] [-trim_qual_rule rule] [-trim_qual_window
    int_value] [-trim_qual_step int_value] [-seq_case case] [-dna_rna type]
    [-line_width int_value] [-rm_header] [-seq_id id_string] [-out_format
    int_value] [-out_good filename_prefix] [-out_bad filename_prefix]
    [-phred64] [-stats_info] [-stats_len] [-stats_dinuc] [-stats_tag]
    [-stats_dupl] [-stats_ns] [-stats_assembly] [-stats_all] [-aa]
    [-graph_data file] [-graph_stats string] [-qual_noscale]
    [-no_qual_header] [-exact_only] [-log file] [-custom_params string]
    [-params file] [-seq_id_mappings file]

Options:
    -help | -h
            Print the help message; ignore other arguments.

    -man    Print the full documentation; ignore other arguments.

    -version
            Print program version; ignore other arguments.

    -verbose
            Prints status and info messages during processing.

    ***** INPUT OPTIONS *****
    -fastq <file>
            Input file in FASTQ format that contains the sequence and
            quality data. Use stdin instead of a file name to read from
            STDIN (-fasta stdin). This can be useful to process compressed
            files using Unix pipes.

    -fasta <file>
            Input file in FASTA format that contains the sequence data. Use
            stdin instead of a file name to read from STDIN (-fastq stdin).
            This can be useful to process compressed files using Unix pipes.

    -qual <file>
            Input file in QUAL format that contains the quality data.

    -fastq2 <file>
            For paired-end data only. Input file in FASTQ format that
            contains the sequence and quality data. The sequence identifiers
            for two matching paired-end sequences in separate files can be
            marked by /1 and /2, or _L and _R, or _left and _right, or must
            have the exact same identifier in both input files. The input
            sequences must be sorted by their sequence identifiers.
            Singletons are allowed in the input files.

    -fasta2 <file>
            For paired-end data only. Input file in FASTA format that
            contains the sequence data. The sequence identifiers for two
            matching paired-end sequences in separate files can be marked by
            /1 and /2, or _L and _R, or _left and _right, or must have the
            exact same identifier in both input files. The input sequences
            must be sorted by their sequence identifiers. Singletons are
            allowed in the input files.

    -params <file>
            Input file in text format that contains PRINSEQ parameters. Each
            parameter should be specified on a new line and arguments should
            be separated by spaces or tabs. Comments can be specified on
            lines starting with the # sign. Can be combined with command
            line parameters. Parameters specified on the command line will
            overwrite the arguments in the file (if any).

    -si13   This option was replaced by option -phred64.

    -phred64
            Quality data in FASTQ file is in Phred+64 format
            (http://en.wikipedia.org/wiki/FASTQ_format#Encoding). Not
            required for Illumina 1.8+, Sanger, Roche/454, Ion Torrent,
            PacBio data.

    -aa     Input is amino acid (protein) sequences instead of nucleic acid
            (DNA or RNA) sequences. Allowed amino acid characters:
            ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*- and allowed
            nucleic acid characters: ACGTURYKMSWBDHVNXacgturykmswbdhvnx-

            The following options are ignored for -aa:
            stats_dinuc,stats_tag,stats_ns,dna_rna

    ***** OUTPUT OPTIONS *****
    -out_format <integer>
            To change the output format, use one of the following options.
            If not defined, the output format will be the same as the input
            format.

            1 (FASTA only), 2 (FASTA and QUAL), 3 (FASTQ), 4 (FASTQ and
            FASTA), or 5 (FASTQ, FASTA and QUAL)

    -out_good <string>
            By default, the output files are created in the same directory
            as the input file containing the sequence data with an
            additional "_prinseq_good_XXXX" in their name (where XXXX is
            replaced by random characters to prevent overwriting previous
            files). To change the output filename and location, specify the
            filename using this option. The file extension will be added
            automatically (either .fasta, .qual, or .fastq). For paired-end
            data, filenames contain additionally "_1", "_1_singletons",
            "_2", and "_2_singletons" before the file extension. Use
            "-out_good null" to prevent the program from generating the
            output file(s) for data passing all filters. Use "-out_good
            stdout" to write data passing all filters to STDOUT (only for
            FASTA or FASTQ output files).

            Example: use "file_passed" to generate the output file
            file_passed.fasta in the current directory

    -out_bad <string>
            By default, the output files are created in the same directory
            as the input file containing the sequence data with an
            additional "_prinseq_bad_XXXX" in their name (where XXXX is
            replaced by random characters to prevent overwriting previous
            files). To change the output filename and location, specify the
            filename using this option. The file extension will be added
            automatically (either .fasta, .qual, or .fastq). For paired-end
            data, filenames contain additionally "_1" and "_2" before the
            file extension. Use "-out_bad null" to prevent the program from
            generating the output file(s) for data not passing any filter.
            Use "-out_bad stdout" to write data not passing any filter to
            STDOUT (only for FASTA or FASTQ output files).

            Example: use "file_filtered" to generate the output file
            file_filtered.fasta in the current directory

            Example: "-out_good stdout -out_bad null" will write data
            passing filters to STDOUT and data not passing any filter will
            be ignored

    -log <file>
            Log file to keep track of parameters, errors, etc. The log file
            name is optional. If no file name is given, the log file name
            will be "inputname.log". If the log file already exists, new
            content will be added to the file.

    -graph_data <file>
            File that contains the necessary information to generate the
            graphs similar to the ones in the web version. The file name is
            optional. If no file name is given, the file name will be
            "inputname.gd". If the file already exists, new content will
            overwrite the file. Use "-out_good null -out_bad null" to
            prevent generating any additional outputs. (See below for more
            options related to the graph data.)

            The graph data can be used as input for the prinseq-graphs.pl
            file to generate the PNG graph files or an HTML report file. If
            you have trouble installing the required prinseq-graphs.pl
            modules or want to see an output example report, upload the
            graph data file at: http://edwards.sdsu.edu/prinseq/ -> Choose
            "Get Report"

    -graph_stats <string>
            Use this option to select what statistics should be calculated
            and included in the graph_data file. This is useful if you e.g.
            do not need sequence complexity information, which requires a
            lot of computation. Requires to have graph_data specified.
            Default is all selected.

            Allowed option are (separate multiple by comma with no spaces):
            ld (Length distribution), gc (GC content distribution), qd (Base
            quality distribution), ns (Occurence of N), pt (Poly-A/T tails),
            ts (Tag sequence check), aq (Assembly quality measure), de
            (Sequence duplication - exact only), da (Sequence duplication -
            exact + 5'/3'), sc (Sequence complexity), dn (Dinucleotide odds
            ratios, includes the PCA plots)

            Example use: -graph_stats ld,gc,qd,de

    -qual_noscale
            Use this option if all your sequences are shorter than 100bp as
            they do not require to scale quality data to 100 data points in
            the graph. By default, quality scores of sequences shorter than
            100bp or longer than 100bp are fit to 100 data points. (To
            retrieve this information and calculate the graph data would
            otherwise require to parse the data two times or store all the
            quality data in memory.)

    -no_qual_header
            In order to reduce the file size, this option will generate an
            empty header line for the quality data in FASTQ files. Instead
            of +header, only the + sign will be output. The header of the
            sequence data will be left unchanged. This option applies to
            FASTQ output files only.

    -exact_only
            Use this option to check for exact (forward and reverse)
            duplicates only when generating the graph data. This allows to
            keep the memory requirements low for large input files and is
            faster. This option will automatically be applied when using
            -derep options 1 and/or 4 only. Specify option -derep 1 or
            -derep 4 if you do not want to apply both at the same time.

    -seq_id_mappings <file>
            Text file containing the old and new (specified with -seq_id)
            identifiers for later reference. This option is useful if e.g. a
            renamed sequence has to be identified based on the new sequence
            identifier. The file name is optional. If no file name is given,
            the file name will be "inputname_prinseq_good.ids" (only good
            sequences are renamed). If a file with the same name already
            exists, new content will overwrite the old file. The text file
            contains one sequence identifier pair per line, separated by
            tabs (old-tab-new). Requires option -seq_id.

    ***** FILTER OPTIONS *****
    -min_len <integer>
            Filter sequence shorter than min_len.

    -max_len <integer>
            Filter sequence longer than max_len.

    -range_len <string>
            Filter sequence by length range. Multiple range values should be
            separated by comma without spaces.

            Example: -range_len 50-100,250-300

    -min_gc <integer>
            Filter sequence with GC content below min_gc.

    -max_gc <integer>
            Filter sequence with GC content above max_gc.

    -range_gc <string>
            Filter sequence by GC content range. Multiple range values
            should be separated by comma without spaces.

            Example: -range_gc 50-60,75-90

    -min_qual_score <integer>
            Filter sequence with at least one quality score below
            min_qual_score.

    -max_qual_score <integer>
            Filter sequence with at least one quality score above
            max_qual_score.

    -min_qual_mean <integer>
            Filter sequence with quality score mean below min_qual_mean.

    -max_qual_mean <integer>
            Filter sequence with quality score mean above max_qual_mean.

    -ns_max_p <integer>
            Filter sequence with more than ns_max_p percentage of Ns.

    -ns_max_n <integer>
            Filter sequence with more than ns_max_n Ns.

    -noniupac
            Filter sequence with characters other than A, C, G, T or N.

    -seq_num <integer>
            Only keep the first seq_num number of sequences (that pass all
            other filters).

    -derep <integer>
            Type of duplicates to filter. Allowed values are 1, 2, 3, 4 and
            5. Use integers for multiple selections (e.g. 124 to use type 1,
            2 and 4). The order does not matter. Option 2 and 3 will set 1
            and option 5 will set 4 as these are subsets of the other
            option.

            1 (exact duplicate), 2 (5' duplicate), 3 (3' duplicate), 4
            (reverse complement exact duplicate), 5 (reverse complement
            5'/3' duplicate)

    -derep_min <integer>
            This option specifies the number of allowed duplicates. If you
            want to remove sequence duplicates that occur more than x times,
            then you would specify x+1 as the -derep_min values. For
            examples, to remove sequences that occur more than 5 times, you
            would specify -derep_min 6. This option can only be used in
            combination with -derep 1 and/or 4 (forward and/or reverse exact
            duplicates). [default : 2]

    -lc_method <string>
            Method to filter low complexity sequences. The current options
            are "dust" and "entropy". Use "-lc_method dust" to calculate the
            complexity using the dust method.

    -lc_threshold <integer>
            The threshold value (between 0 and 100) used to filter sequences
            by sequence complexity. The dust method uses this as maximum
            allowed score and the entropy method as minimum allowed value.

    -custom_params <string>
            Can be used to specify additional filters. The current set of
            possible rules is limited and has to follow the specifications
            below. The custom parameters have to be specified within quotes
            (either ' or ").

            Please separate parameter values with a space and separate new
            parameter sets with semicolon (;). Parameters are defined by two
            values: (1) the pattern (any combination of the letters
            "ACGTN"), (2) the number of repeats or percentage of occurence
            Percentage values are defined by a number followed by the %-sign
            (without space). If no %-sign is given, it is assumed that the
            given number specifies the number of repeats of the pattern.

            Examples: "AAT 10" (filters out sequences containing
            AATAATAATAATAATAATAATAATAATAAT anywhere in the sequence), "T
            70%" (filters out sequences with more than 70% Ts in the
            sequence), "A 15" (filters out sequences containing
            AAAAAAAAAAAAAAA anywhere in the sequence), "AAT 10;T 70%;A 15"
            (apply all three filters)

    ***** TRIM OPTIONS *****
    -trim_to_len <integer>
            Trim all sequence from the 3'-end to result in sequence with
            this length.

    -trim_left <integer>
            Trim sequence at the 5'-end by trim_left positions.

    -trim_right <integer>
            Trim sequence at the 3'-end by trim_right positions.

    -trim_left_p <integer>
            Trim sequence at the 5'-end by trim_left_p percentage of read
            length. The trim length is rounded towards the lower integer
            (e.g. 143.6 is rounded to 143 positions). Use an integer between
            1 and 100 for the percentage value.

    -trim_right_p <integer>
            Trim sequence at the 3'-end by trim_right_p percentage of read
            length. The trim length is rounded towards the lower integer
            (e.g. 143.6 is rounded to 143 positions). Use an integer between
            1 and 100 for the percentage value.

    -trim_tail_left <integer>
            Trim poly-A/T tail with a minimum length of trim_tail_left at
            the 5'-end.

    -trim_tail_right <integer>
            Trim poly-A/T tail with a minimum length of trim_tail_right at
            the 3'-end.

    -trim_ns_left <integer>
            Trim poly-N tail with a minimum length of trim_ns_left at the
            5'-end.

    -trim_ns_right <integer>
            Trim poly-N tail with a minimum length of trim_ns_right at the
            3'-end.

    -trim_qual_left <integer>
            Trim sequence by quality score from the 5'-end with this
            threshold score.

    -trim_qual_right <integer>
            Trim sequence by quality score from the 3'-end with this
            threshold score.

    -trim_qual_type <string>
            Type of quality score calculation to use. Allowed options are
            min, mean, max and sum. [default: min]

    -trim_qual_rule <string>
            Rule to use to compare quality score to calculated value.
            Allowed options are lt (less than), gt (greater than) and et
            (equal to). [default: lt]

    -trim_qual_window <integer>
            The sliding window size used to calculate quality score by type.
            To stop at the first base that fails the rule defined, use a
            window size of 1. [default: 1]

    -trim_qual_step <integer>
            Step size used to move the sliding window. To move the window
            over all quality scores without missing any, the step size
            should be less or equal to the window size. [default: 1]

    ***** REFORMAT OPTIONS *****
    -seq_case <string>
            Changes sequence character case to upper or lower case. Allowed
            options are "upper" and "lower". Use this option to remove
            soft-masking from your sequences.

    -dna_rna <string>
            Convert sequence between DNA and RNA. Allowed options are "dna"
            (convert from RNA to DNA) and "rna" (convert from DNA to RNA).

    -line_width <integer>
            Sequence characters per line. Use 0 if you want each sequence in
            a single line. Use 80 for line breaks every 80 characters. Note
            that this option only applies to FASTA output files, since FASTQ
            files store sequences without additional line breaks. [default:
            60]

    -rm_header
            Remove the sequence header. This includes everything after the
            sequence identifier (which is kept unchanged).

    -seq_id <string>
            Rename the sequence identifier. A counter is added to each
            identifier to assure its uniqueness. Use option -seq_id_mappings
            to generate a file containing the old and new identifiers for
            later reference.

            Example: "mySeq_10" will generate the IDs (in FASTA format)
            >mySeq_101, >mySeq_102, >mySeq_103, ...

    ***** SUMMARY STATISTIC OPTIONS *****
            The summary statistic values are written to STDOUT in the form:
            "parameter_name statistic_name value" (without the quotes). For
            example, "stats_info reads 10000" or "stats_len max 500". Only
            one statistic is written per line and values are separated by
            tabs.

            If you specify any statistic option, no other ouput will be
            generated. To preprocess data, do not specify a statistics
            option.

    -stats_info
            Outputs basic information such as number of reads (reads) and
            total bases (bases).

    -stats_len
            Outputs minimum (min), maximum (max), range (range), mean
            (mean), standard deviation (stddev), mode (mode) and mode value
            (modeval), and median (median) for read length.

    -stats_dinuc
            Outputs the dinucleotide odds ratio for AA/TT (aatt), AC/GT
            (acgt), AG/CT (agct), AT (at), CA/TG (catg), CC/GG (ccgg), CG
            (cg), GA/TC (gatc), GC (gc) and TA (ta).

    -stats_tag
            Outputs the probability of a tag sequence at the 5'-end (prob5)
            and 3'-end (prob3) in percentage (0..100). Provides the number
            of predefined MIDs (midnum) and the MID sequences (midseq,
            separated by comma, only provided if midnum > 0) that occur in
            more than 34/100 (approx. 3%) of the reads.

    -stats_dupl
            Outputs the number of exact duplicates (exact), 5' duplicates
            (5), 3' duplicates (3), exact duplicates with reverse
            complements (exactrevcom) and 5'/3' duplicates with reverse
            complements (revcomp), and total number of duplicates (total).
            The maximum number of duplicates is given under the value name
            with an additional "maxd" (e.g. exactmaxd or 5maxd).

    -stats_ns
            Outputs the number of reads with ambiguous base N (seqswithn),
            the maximum number of Ns per read (maxn) and the maximum
            percentage of Ns per read (maxp). The maxn and maxp value are
            not necessary from the same sequence.

    -stats_assembly
            Outputs the N50, N90, etc contig sizes. The Nxx contig size is a
            weighted median that is defined as the length of the smallest
            contig C in the sorted list of all contigs where the cumulative
            length from the largest contig to contig C is at least xx% of
            the total length (sum of contig lengths).

    -stats_all
            Outputs all available summary statistics.

    ***** ORDER OF PROCESSING *****
            The available options are processed in the following order:

            seq_num, trim_left, trim_right, trim_left_p, trim_right_p,
            trim_qual_left, trim_qual_right, trim_tail_left,
            trim_tail_right, trim_ns_left, trim_ns_right, trim_to_len,
            min_len, max_len, range_len, min_qual_score, max_qual_score,
            min_qual_mean, max_qual_mean, min_gc, max_gc, range_gc,
            ns_max_p, ns_max_n, noniupac, lc_method, derep, seq_id,
            seq_case, dna_rna, out_format

• Groomer [13], a python script, developed for and part of Galaxy, and used to convert between fastq formats

## If you really insist to use Groomer, the code was copied from the Galaxy environment to be used as standalone under CLI.
## Otherwise, please consider 'Prinseq-lite.pl' which allows many more filtering steps from a single command.
# fastq_groomer.py
# expected parameters from the code header. The 'advanced' galaxy defaults are shown in every second line
$1  input_filename = sys.argv[1] (fastq file)
                File to groom: <from your history>
$2  input_type = sys.argv[2] (phred solexa, sanger, illumina, cssanger ...)
                Input FASTQ quality scores type: 'Sanger'
$3  output_filename = sys.argv[3] (fastq file)
                <not shown => new history object>
$4  output_type = sys.argv[4] (phred solexa, sanger, illumina, cssanger ...)
                Output FASTQ quality scores type: 'Sanger (recommended)'
$5  force_quality_encoding = sys.argv[5] | (None, 'ascii', 'decimal')
                Force Quality Score encoding: 'ASCII'
$6  summarize_input = sys.argv[6] == 'summarize_input' (summarize_input, 'None' or any other text for none)
                Summarize input data: 'Summarized Input'

## example command
### => finds 'fastq-groomer.py' in the PATH to avoid having to give its full location
python2.7 $(which fastq_groomer.py) \
    some-illumina-data.fastq \
    illumina \
    groomed.fastq \
    sanger \
    None \
    summarize_input

• fastq-mcf (v1.04.676), fastQ filtering toolbox part of the Google-code package ea-utils [14]

This program is part of the Google-code package ea-utils. You can download the package and build it on unix computer systems with g++ and 'libgsl0-dev|gsl-devel' (the GNU scientific library, ubuntu|centos) installed.

The **fastq-mcf** command expects the following input and output parameters

Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]
Version: 1.04.676

Detects levels of adapter presence, computes likelihoods and
locations (start, end) of the adapters.   Removes the adapter
sequences from the fastq file(s).

Stats go to stderr, unless -o is specified.

Specify -0 to turn off all default settings

If you specify multiple 'paired-end' inputs, then a -o option is
required for each.  IE: -o read1.clip.q -o read2.clip.fq

Options:
    -h       This help
    -o FIL   Output file (stats to stdout)
    -s N.N   Log scale for adapter minimum-length-match (2.2)
    -t N     % occurance threshold before adapter clipping (0.25)
    -m N     Minimum clip length, overrides scaled auto (1)
    -p N     Maximum adapter difference percentage (10)
    -l N     Minimum remaining sequence length (19)
    -L N     Maximum remaining sequence length (none)
    -D N     Remove duplicate reads : Read_1 has an identical N bases (0)
    -k N     sKew percentage-less-than causing cycle removal (2)
    -x N     'N' (Bad read) percentage causing cycle removal (20)
    -q N     quality threshold causing base removal (10)
    -w N     window-size for quality trimming (1)
    -H       remove >95% homopolymer reads (no)
    -X       remove low complexity reads (no)
    -0       Set all default parameters to zero/do nothing
    -U|u     Force disable/enable Illumina PF filtering (auto)
    -P N     Phred-scale (auto)
    -R       Don't remove N's from the fronts/ends of reads
    -n       Don't clip, just output what would be done
    -C N     Number of reads to use for subsampling (300k)
    -S       Save all discarded reads to '.skip' files
    -d       Output lots of random debugging stuff

Quality adjustment options:
    --cycle-adjust      CYC,AMT   Adjust cycle CYC (negative = offset from end) by amount AMT
    --phred-adjust      SCORE,AMT Adjust score SCORE by amount AMT
    --phred-adjust-max  SCORE     Adjust scores > SCORE to SCOTE

Filtering options*:
    --[mate-]qual-mean  NUM       Minimum mean quality score
    --[mate-]qual-gt    NUM,THR   At least NUM quals > THR
    --[mate-]max-ns     NUM       Maxmium N-calls in a read (can be a %)
    --[mate-]min-len    NUM       Minimum remaining length (same as -l)
    --homopolymer-pct   PCT       Homopolymer filter percent (95)
    --lowcomplex-pct    PCT       Complexity filter percent (95)

If mate- prefix is used, then applies to second non-barcode read only

Adapter files are 'fasta' formatted:

Specify n/a to turn off adapter clipping, and just use filters

Increasing the scale makes recognition-lengths longer, a scale
of 100 will force full-length recognition of adapters.

Adapter sequences with _5p in their label will match 'end's,
and sequences with _3p in their label will match 'start's,
otherwise the 'end' is auto-determined.

Skew is when one cycle is poor, 'skewed' toward a particular base.
If any nucleotide is less than the skew percentage, then the
whole cycle is removed.  Disable for methyl-seq, etc.

Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done
for miRNA, amplicon and other low-complexity situations!)

Duplicate read filtering is appropriate for assembly tasks, and
never when read length < expected coverage.  -D 50 will use
4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.

*Quality filters are evaluated after clipping/trimming

Homopolymer filtering is a subset of low-complexity, but will not
be separately tracked unless both are turned on.

• tophat (TopHat 2.0.13 release 10/2/2014) [15]', a tool for mapping split reads to a reference genome

TopHat maps short sequences from spliced transcripts to whole genomes.

Usage:
    tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] \
                                    [quals1,[quals2,...]] [quals1[,quals2,...]]

Options:
    -v/--version
    -o/--output-dir                <string>    [ default: ./tophat_out         ]
    --bowtie1                                  [ default: bowtie2              ]
    -N/--read-mismatches           <int>       [ default: 2                    ]
    --read-gap-length              <int>       [ default: 2                    ]
    --read-edit-dist               <int>       [ default: 2                    ]
    --read-realign-edit-dist       <int>       [ default: "read-edit-dist" + 1 ]
    -a/--min-anchor                <int>       [ default: 8                    ]
    -m/--splice-mismatches         <0-2>       [ default: 0                    ]
    -i/--min-intron-length         <int>       [ default: 50                   ]
    -I/--max-intron-length         <int>       [ default: 500000               ]
    -g/--max-multihits             <int>       [ default: 20                   ]
    --suppress-hits
    -x/--transcriptome-max-hits    <int>       [ default: 60                   ]
    -M/--prefilter-multihits                   ( for -G/--GTF option, enable
                                                 an initial bowtie search
                                                 against the genome )
    --max-insertion-length         <int>       [ default: 3                    ]
    --max-deletion-length          <int>       [ default: 3                    ]
    --solexa-quals
    --solexa1.3-quals                          (same as phred64-quals)
    --phred64-quals                            (same as solexa1.3-quals)
    -Q/--quals
    --integer-quals
    -C/--color                                 (Solid - color space)
    --color-out
    --library-type                 <string>    (fr-unstranded, fr-firststrand,
                                                fr-secondstrand)
    -p/--num-threads               <int>       [ default: 1                   ]
    -R/--resume                    <out_dir>   ( try to resume execution )
    -G/--GTF                       <filename>  (GTF/GFF with known transcripts)
    --transcriptome-index          <bwtidx>    (transcriptome bowtie index)
    -T/--transcriptome-only                    (map only to the transcriptome)
    -j/--raw-juncs                 <filename>
    --insertions                   <filename>
    --deletions                    <filename>
    -r/--mate-inner-dist           <int>       [ default: 50                  ]
    --mate-std-dev                 <int>       [ default: 20                  ]
    --no-novel-juncs
    --no-novel-indels
    --no-gtf-juncs
    --no-coverage-search
    --coverage-search
    --microexon-search
    --keep-tmp
    --tmp-dir                      <dirname>   [ default: <output_dir>/tmp ]
    -z/--zpacker                   <program>   [ default: gzip             ]
    -X/--unmapped-fifo                         [use mkfifo to compress more temporary
                                                 files for color space reads]

Advanced Options:
    --report-secondary-alignments
    --no-discordant
    --no-mixed

    --segment-mismatches           <int>       [ default: 2                ]
    --segment-length               <int>       [ default: 25               ]

    --bowtie-n                                 [ default: bowtie -v        ]
    --min-coverage-intron          <int>       [ default: 50               ]
    --max-coverage-intron          <int>       [ default: 20000            ]
    --min-segment-intron           <int>       [ default: 50               ]
    --max-segment-intron           <int>       [ default: 500000           ]
    --no-sort-bam                              (Output BAM is not coordinate-sorted)
    --no-convert-bam                           (Do not output bam format.
                                                Output is <output_dir>/accepted_hit.sam)
    --keep-fasta-order
    --allow-partial-mapping

Bowtie2 related options:
  Preset options in --end-to-end mode (local alignment is not used in TopHat2)
    --b2-very-fast
    --b2-fast
    --b2-sensitive
    --b2-very-sensitive

  Alignment options
    --b2-N                         <int>       [ default: 0                ]
    --b2-L                         <int>       [ default: 20               ]
    --b2-i                         <func>      [ default: S,1,1.25         ]
    --b2-n-ceil                    <func>      [ default: L,0,0.15         ]
    --b2-gbar                      <int>       [ default: 4                ]

  Scoring options
    --b2-mp                        <int>,<int> [ default: 6,2              ]
    --b2-np                        <int>       [ default: 1                ]
    --b2-rdg                       <int>,<int> [ default: 5,3              ]
    --b2-rfg                       <int>,<int> [ default: 5,3              ]
    --b2-score-min                 <func>      [ default: L,-0.6,-0.6      ]

  Effort options
    --b2-D                         <int>       [ default: 15               ]
    --b2-R                         <int>       [ default: 2                ]

Fusion related options:
    --fusion-search
    --fusion-anchor-length         <int>       [ default: 20               ]
    --fusion-min-dist              <int>       [ default: 10000000         ]
    --fusion-read-mismatches       <int>       [ default: 2                ]
    --fusion-multireads            <int>       [ default: 2                ]
    --fusion-multipairs            <int>       [ default: 2                ]
    --fusion-ignore-chromosomes    <list>      [ e.g, <chrM,chrX>          ]

    --fusion-do-not-resolve-conflicts          [this is for test purposes  ]

SAM Header Options (for embedding sequencing run metadata in output):
    --rg-id                        <string>    (read group ID)
    --rg-sample                    <string>    (sample ID)
    --rg-library                   <string>    (library ID)
    --rg-description               <string>    (descriptive string, no tabs allowed)
    --rg-platform-unit             <string>    (e.g Illumina lane ID)
    --rg-center                    <string>    (sequencing center name)
    --rg-date                      <string>    (ISO 8601 date of the sequencing run)
    --rg-platform                  <string>    (Sequencing platform descriptor)

    for detailed help see http://tophat.cbcb.umd.edu/manual.html

• samtools (v1.2.1 using htslib) [16], THE canonical SAM/BAM manipulation suite

Program: samtools (Tools for alignments in the SAM format)
Version: 1.2 (using htslib 1.2.1)

Usage:   samtools <command> [options]

Commands:
  -- indexing
         faidx       index/extract FASTA
         index       index alignment
  -- editing
         calmd       recalculate MD/NM tags and '=' bases
         fixmate     fix mate information
         reheader    replace BAM header
         rmdup       remove PCR duplicates
         targetcut   cut fosmid regions (for fosmid pool only)
  -- file operations
         bamshuf     shuffle and group alignments by name
         cat         concatenate BAMs
         merge       merge sorted alignments
         mpileup     multi-way pileup
         sort        sort alignment file
         split       splits a file by read group
         bam2fq      converts a BAM to a FASTQ
  -- stats
         bedcov      read depth per BED region
         depth       compute the depth
         flagstat    simple stats
         idxstats    BAM index stats
         phase       phase heterozygotes
         stats       generate stats (former bamcheck)
  -- viewing
         flags       explain BAM flags
         tview       text alignment viewer
         view        SAM<->BAM<->CRAM conversion

• Picard (1.119) [17], the 'broadly' used java wrapper for samtools with added functionalities

## The current list of Picard comments includes:
AddCommentsToBam.jar, AddOrReplaceReadGroups.jar, BamIndexStats.jar, BamToBfq.jar, BuildBamIndex.jar,
CalculateHsMetrics.jar, CheckIlluminaDirectory.jar, CleanSam.jar, CollectAlignmentSummaryMetrics.jar,
CollectGcBiasMetrics.jar, CollectInsertSizeMetrics.jar, CollectMultipleMetrics.jar, CollectRnaSeqMetrics.jar,
CollectTargetedPcrMetrics.jar, CollectWgsMetrics.jar, CompareSAMs.jar, CreateSequenceDictionary.jar,
DownsampleSam.jar, EstimateLibraryComplexity.jar, ExtractIlluminaBarcodes.jar, ExtractSequences.jar,
FastqToSam.jar, FilterSamReads.jar, FixMateInformation.jar, GatherBamFiles.jar, IlluminaBasecallsToFastq.jar,
IlluminaBasecallsToSam.jar, IntervalListTools.jar, MakeSitesOnlyVcf.jar, MarkDuplicates.jar, MarkIlluminaAdapters.jar,
MeanQualityByCycle.jar, MergeBamAlignment.jar, MergeSamFiles.jar, MergeVcfs.jar, NormalizeFasta.jar,
QualityScoreDistribution.jar, ReorderSam.jar, ReplaceSamHeader.jar, RevertOriginalBaseQualitiesAndAddMateCigar.jar,
RevertSam.jar, SamFormatConverter.jar, SamToFastq.jar, SortSam.jar, SplitVcfs.jar, ValidateSamFile.jar,
VcfFormatConverter.jar, ViewSam.jar

## get help in Picard by typing
java -jar $PICARD <command.jar> -h

• RSeQC (v2.6.1) [18] package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data.

## the current list of RSeQC python tools includes
bam2wig.py
bam2fq.py
bam_stat.py
clipping_profile.py
geneBody_coverage.py
geneBody_coverage2.py
infer_experiment.py
inner_distance.py
junction_annotation.py
junction_saturation.py
overlay_bigwig.py
normalize_bigwig.py
read_distribution.py
read_duplication.py
read_GC.py
read_NVC.py
read_quality.py
read_hexamer.py
RPKM_count.py
RPKM_saturation.py
spilt_bam.py

## help for a module is generally obtained by typing the module name followed by 'enter', eg: 'bam2wig.py'

• HTSeq (v0.6.1) [19] is a Python package that provides infrastructure to process data from high-throughput sequencing assays (used here to summarize read counts per gene)

## Besides the python library that can be used by advanced users, HTSeq also provides two integrated commands.

## Quality Assessment with 'htseq-qa'
Given a FASTQ or SAM file, this script produces a PDF file with plots depicting the base calls and base-call qualities by position in the read. This is useful to assess the technical quality of a sequencing run.

Usage: htseq-qa [options] read_file

This script take a file with high-throughput sequencing reads (supported
formats: SAM, Solexa _export.txt, FASTQ, Solexa _sequence.txt) and performs a
simply quality assessment by producing plots showing the distribution of
called bases and base-call quality scores by position within the reads. The
plots are output as a PDF file.

Options:
  -h, --help            show this help message and exit
  -t TYPE, --type=TYPE  type of read_file (one of: sam [default], bam, solexa-
                        export, fastq, solexa-fastq)
  -o OUTFILE, --outfile=OUTFILE
                        output filename (default is <read_file>.pdf)
  -r READLEN, --readlength=READLEN
                        the maximum read length (when not specified, the
                        script guesses from the file
  -g GAMMA, --gamma=GAMMA
                        the gamma factor for the contrast adjustment of the
                        quality score plot
  -n, --nosplit         do not split reads in unaligned and aligned ones
  -m MAXQUAL, --maxqual=MAXQUAL
                        the maximum quality score that appears in the data
                        (default: 41)

Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology
Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
Public License v3. Part of the 'HTSeq' framework, version 0.6.1p1.

## Counting reads in features with 'htseq-count' (++ USED IN THIS TRAINING)
Given a SAM file with alignments and a GFF file with genomic features, this script counts how many reads map to each feature.

Usage: htseq-count [options] alignment_file gff_file

This script takes an alignment file in SAM/BAM format and a feature file in
GFF format and calculates for each feature the number of reads mapping to it.
See http://www-huber.embl.de/users/anders/HTSeq/doc/count.html for details.

Options:
  -h, --help            show this help message and exit
  -f SAMTYPE, --format=SAMTYPE
                        type of <alignment_file> data, either 'sam' or 'bam'
                        (default: sam)
  -r ORDER, --order=ORDER
                        'pos' or 'name'. Sorting order of <alignment_file>
                        (default: name). Paired-end sequencing data must be
                        sorted either by position or by read name, and the
                        sorting order must be specified. Ignored for single-
                        end data.
  -s STRANDED, --stranded=STRANDED
                        whether the data is from a strand-specific assay.
                        Specify 'yes', 'no', or 'reverse' (default: yes).
                        'reverse' means 'yes' with reversed strand
                        interpretation
  -a MINAQUAL, --minaqual=MINAQUAL
                        skip all reads with alignment quality lower than the
                        given minimum value (default: 10)
  -t FEATURETYPE, --type=FEATURETYPE
                        feature type (3rd column in GFF file) to be used, all
                        features of other type are ignored (default, suitable
                        for Ensembl GTF files: exon)
  -i IDATTR, --idattr=IDATTR
                        GFF attribute to be used as feature ID (default,
                        suitable for Ensembl GTF files: gene_id)
  -m MODE, --mode=MODE  mode to handle reads overlapping more than one feature
                        (choices: union, intersection-strict, intersection-
                        nonempty; default: union)
  -o SAMOUT, --samout=SAMOUT
                        write out all SAM alignment records into an output SAM
                        file called SAMOUT, annotating each line with its
                        feature assignment (as an optional field with tag
                        'XF')
  -q, --quiet           suppress progress report

Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology
Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
Public License v3. Part of the 'HTSeq' framework, version 0.6.1p1.

• Qualimap (v.2.0.2)[20][21], a Java suite for QC of mapping data.

QualiMap v.2.0.2
Built on 2015-02-10 11:39

usage: qualimap <tool> [options]

To launch GUI leave <tool> empty.

Available tools:

    bamqc            Evaluate NGS mapping to a reference genome
    rnaseq           Evaluate RNA-seq alignment data
    counts           Counts data analysis (further RNA-seq data evaluation)
    multi-bamqc      Compare QC reports from multiple NGS mappings
    clustering       Cluster epigenomic signals
    comp-counts      Compute feature counts

Special arguments:

    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G

• Deeptools [22], a python toolbox able to manipulate and compare mapping data (also available as a live Galaxy server[23])

deepTools is a suite of user-friendly tools for the visualization, quality control and normalization of data from deep-sequencing DNA sequencing experiments.

deepTools offers multiple methods for highly-customizable data visualization that immensely aid hypothesis generation and data interpretation. It also offers all the tools needed to create coverage files in standard bedGraph and bigWig file formats allowing various normalization procedures and comparisons between two files (for example, treatment and control).
Installation page: https://github.com/fidelram/deepTools/wiki/Installing-deepTools

• R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"[24] - RStudio (Version 0.98.1102)[25] - and Bioconductor (v3.0)[26] packages, THE statistical Biocomputing tool-suite

## The R programming language
Statistic analysis requires knowledge and tools for which simple programs do exist (eg Prism provided by VIB). However, when it comes to high throughput data analysis, such simple programs are not anymore usable and advanced programming becomes necessary.

[R] is the language of choice to perform complex tasks on data and has long been adopted as the main language for BioStatistics, often under the name of BioConductor. [R] deserves several dedicated training courses for in-depth learning which is not the object of this session. You can obtain the [R] programming software and a lot of training material from the official web-site

## RStudio is our preferred R-GUI as it is in continuous development and allows documenting the R-code and produce beautiful reports
RStudio is available as client and server free versions  

## A number of pre-installed R packages are not listed here.
## More specific packages used in tis training and dedicated to RNASeq analysis include:
* EdgeR <ref>http://www.bioconductor.org/packages/release/bioc/html/edgeR.html</ref>
* DESeq <ref>http://www.bioconductor.org/packages/release/bioc/html/DESeq.html</ref>
* DESeq2 <ref>http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html</ref>
* ...

 

Technical.png Version number reported for each tool correspond to the lowest version used to prepare this training and may differ from the current version. Older versions may as well do!

Technical.png Additional tools were installed from downloads or from source and are referenced to web links in the respective pages

 

Complementary Learning: Recommended and Related Readings

Get HELP from the community

Handicon.png There are naive questions, tedious questions, ill-phrased questions, questions put after inadequate self-criticism. But every question is a cry to understand the world. There is no such thing as a dumb question.

   - Carl Sagan, The Demon-Haunted World: Science as a Candle in the Dark

 

new NGS-CLI players in town

The Tophat team just published an improved RNASeq splice mapper called HISAT and its new version HISAT2 that should replace Tophat while keeping similar features. Although still in its Beta version, you may find useful to test this new mapper that requires much less RAM to hold the reference genome index (28GB for STAR and only 4.3GB for HISAT which will match many middle-end desktops and high-end laptops). The speed of mapping is reported equal or slightly higher than for STAR which is today the fastest mapper. The accuracy is claimed as good as the bests. For more info, please read their paper in Nature Methods of March (doi:10.1038/nmeth.3317[30]).

A quite interesting NextGenSeek page is dedicated to recent tools that will likely complement or replace the historical Tophat workflow ([31]).

Other tools are arising. Have a look fro instance to Kallisto which makes use of pseudoalignments to further speed up the mapping of reads to transcripts.

 

NGS-tools and GUIs

A non-exhaustive list is provided here to propose alternatives to the CLI protocol presented today. Feel free to try these at home and maybe prefer them to the power of command-line.

  • The CLC Genomic workbench, as well as comparable commercial software, offers a reasonably complete, performant, and well documented suite of tools dedicated to NGS data analysis (DNA, RNA, ChIP-Seq, ...). The CLC-GWB is available within VIB with a set of dynamic licenses and additional licenses can be acquired by research groups with need for such platforms (please contact us). Other similar commercial solutions are most likely attractive but not supported by BITS-VIB and left to the evaluation by motivated users (one such GUI platform is Partek which we tried in the past with success but comes at a significant price).
  • UGene offers a large number of tools integrated under a common GUI. It also allows building workflows by linking tools similarly to what Galaxy offers. A demo webcast is available on youtube.
  • RobiNA has now been available for several years and is still efficient in analyzing data from reads or at a later point from the count files obtained from HTSeq. You can use it to analyze counts generated in this session and applying DESeq (v1) or EdgeR packages without having to code in R. (RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics [32])
  • SeqMonk [33] has been around for quite some time and extends to other sequencing applications.
  • PRADA: is a recently published python pipeline for RNA sequencing data analysis that offers many functionalities[34]
  • RNASeqGUI: is an integrated GUI to several R-packages that can be used for RNA-Seq data analysis and reduces the amount of required R code[35]

 

Functional analysis of the DE gene lists

  Obtaining gene counts is only the first step for Biologists who need biological interpretation of the transcriptomics data. Functional analysis is the most frequently used step to answer this need and can be done using a variety of software tools. A recent review paper: 'Ten years of pathway analysis: current approaches and outstanding challenges' [36] provides links to many such tools. Other solutions are listed below for user convenience.

 

Core applications

 

Web-based servers

Some of the following tools offer many alternative analyses context and will prove very useful to get a first insight in your data.

  • lrpath uses a new algoritm very recently published [38] that takes both logFC and pValues into account without fixing limits [39]
  • Enrichr is very flexible and quite complete [40]
  • WebGestalt WEB-based GEne SeT AnaLysis Toolkit is also quite attractive [41]
  • GeneMania was recently published [42] and deserves attention.
  • Biomart now includes a very easy to use enrichment tool to which ENSG (gene-IDs) can be fed to identify enriched classes (give it a try!).
  • DAVID [43] remains the classical reference for enrichment analysis but shows a quite outdated interface

 

[R] packages for functional annotation and enrichment analysis

Again, [R] / Bioconductor offers many solutions. A few have been introduced in previous trainings including:

 

Reviews about RNASeq analyses methods

We provide below a non-exhausting list of reviews and comparative papers. Some of these are already outdated due to the ongoing revolution in the field. Others present strengths and weaknesses of the different tools and approaches. You are welcome to indicate unlisted papers that you consider relevant to extend this list.

  • A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis [45]. Also read from the rna-seqblog page [46] dedicated to this heavily discussed topic.
  • Systematic evaluation of spliced alignment programs for RNA-seq data [47]
  • Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data [48]
  • The EBI online training pages propose training sessions with slides and exercises.

 

Conclusion

This hands-on training session was aimed at presenting a simplified yet complete workflow for NGS RNASeq DE analysis using REAL and FULL sized data (all reads from a human shorter chromosome) and to train users in taking advantage of the many command-line interface resources available to the research community.

The number of research groups generating NGS data is increasing upon time and the demand for data analysis is getting greater than what core facilities can provide and often requires biological background knowledge not available to the facility team. For these reason, more and more students and staff scientists feel the 'urgent need' to perform their own NGS data analysis.

Although a more decent level of computer infrastructure is preferred in order to perform this analysis in ideal conditions, today's main limiting factors remain the choice of adequate tools and be trained to use them properly. This session hopefully demonstrated that a minimum of training is sufficient to reproduce a standard NGS analysis workflow on a broadly available laptop computer. For higher throughput and more complex analysis, additional learning and more resources will be required but a lot can be initiated using the many available online resources and community support.

Your feedback to this two-days general command-line based NGS data analysis session is very important to us and will be used to improve this content for later sessions. If you need more training of the kind, please contact us and we will organize additional hands-on based sessions on your requests. More advance sessions will depend on the availability of expert users within VIB that will accept to prepare specified material.

Handicon.png PS: We tried to cite resources and people that made this training possible. If you feel that some citations are missing, please inform us and we will correct this omission

contact us


References:
  1. http://www.ncbi.nlm.nih.gov/sra?term=SRP012167
  2. http://wiki.bits.vib.be/index.php/Introduction_to_Linux_for_bioinformatics
  3. http://wiki.bits.vib.be/index.php/NGS_data_analysis
  4. https://github.com/BITS-VIB/RNAseq-training
  5. http://www.bioconductor.org/help/workflows/rnaseqGene/
  6. http://www.bioconductor.org/packages/release/bioc/html/edgeR.html
  7. http://mapman.gabipd.org/web/guest/robin
  8. http://www.gnu.org/software/coreutils/manual/coreutils.pdf
  9. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  10. https://code.google.com/p/cutadapt/wiki/documentation, http://journal.embnet.org/index.php/embnetjournal/article/view/200
  11. Robert Schmieder, Robert Edwards
    Quality control and preprocessing of metagenomic datasets.
    Bioinformatics: 2011, 27(6);863-4
    [PubMed:21278185] ##WORLDCAT## [DOI] (I p)

  12. http://prinseq.sourceforge.net
  13. Daniel Blankenberg, Assaf Gordon, Gregory Von Kuster, Nathan Coraor, James Taylor, Anton Nekrutenko, Galaxy Team
    Manipulation of FASTQ data with Galaxy.
    Bioinformatics: 2010, 26(14);1783-5
    [PubMed:20562416] ##WORLDCAT## [DOI] (I p)

  14. https://code.google.com/p/ea-utils/
  15. http://ccb.jhu.edu/software/tophat/index.shtml
  16. http://samtools.sourceforge.net
  17. http://picard.sourceforge.net
  18. http://rseqc.sourceforge.net
  19. http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
  20. Fernando García-Alcalde, Konstantin Okonechnikov, José Carbonell, Luis M Cruz, Stefan Götz, Sonia Tarazona, Joaquín Dopazo, Thomas F Meyer, Ana Conesa
    Qualimap: evaluating next-generation sequencing alignment data.
    Bioinformatics: 2012, 28(20);2678-9
    [PubMed:22914218] ##WORLDCAT## [DOI] (I p)

  21. http://qualimap.bioinfo.cipf.es
  22. https://deeptools.github.io/
  23. http://deeptools.ie-freiburg.mpg.de/
  24. http://cran.r-project.org
  25. http://www.rstudio.com
  26. http://www.bioconductor.org
  27. http://seqanswers.com SeqAnswers
  28. http://www.biostars.org BioStar
  29. http://stackoverflow.com stackoverflow
  30. http://www.nature.com/doifinder/10.1038/nmeth.3317
  31. http://nextgenseek.com/2015/03/three-papers-on-new-rna-seq-methods-offer-a-new-way-to-do-rna-seq-analysis/
  32. Marc Lohse, Anthony M Bolger, Axel Nagel, Alisdair R Fernie, John E Lunn, Mark Stitt, Björn Usadel
    RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics.
    Nucleic Acids Res: 2012, 40(Web Server issue);W622-7
    [PubMed:22684630] ##WORLDCAT## [DOI] (I p)

  33. http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/
  34. Wandaliz Torres-García, Siyuan Zheng, Andrey Sivachenko, Rahulsimham Vegesna, Qianghu Wang, Rong Yao, Michael F Berger, John N Weinstein, Gad Getz, Roel G W Verhaak
    PRADA: pipeline for RNA sequencing data analysis.
    Bioinformatics: 2014, 30(15);2224-6
    [PubMed:24695405] ##WORLDCAT## [DOI] (I p)

  35. Francesco Russo, Claudia Angelini
    RNASeqGUI: a GUI for analysing RNA-Seq data.
    Bioinformatics: 2014, 30(17);2514-6
    [PubMed:24812338] ##WORLDCAT## [DOI] (I p)

  36. Purvesh Khatri, Marina Sirota, Atul J Butte
    Ten years of pathway analysis: current approaches and outstanding challenges.
    PLoS Comput Biol: 2012, 8(2);e1002375
    [PubMed:22383865] ##WORLDCAT## [DOI] (I p)

  37. http://www.broadinstitute.org/gsea/
  38. Chee Lee, Snehal Patil, Maureen A Sartor
    RNA-Enrich: a cut-off free functional enrichment testing method for RNA-seq with improved detection power.
    Bioinformatics: 2016, 32(7);1100-2
    [PubMed:26607492] ##WORLDCAT## [DOI] (I p)

  39. http://lrpath.ncibi.org/
  40. http://amp.pharm.mssm.edu/Enrichr/
  41. http://bioinfo.vanderbilt.edu/webgestalt/
  42. Franck Rapaport, Raya Khanin, Yupu Liang, Mono Pirun, Azra Krek, Paul Zumbo, Christopher E Mason, Nicholas D Socci, Doron Betel
    Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data.
    Genome Biol: 2013, 14(9);R95
    [PubMed:24020486] ##WORLDCAT## [DOI] (I p)

    Sara Mostafavi, Debajyoti Ray, David Warde-Farley, Chris Grouios, Quaid Morris
    GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function.
    Genome Biol: 2008, 9 Suppl 1(Suppl 1);S4
    [PubMed:18613948] ##WORLDCAT## [DOI] (I p)

  43. http://david.abcc.ncifcrf.gov
  44. http://www.bioconductor.org/packages/release/bioc/html/piano.html
  45. Marie-Agnès Dillies, Andrea Rau, Julie Aubert, Christelle Hennequet-Antier, Marine Jeanmougin, Nicolas Servant, Céline Keime, Guillemette Marot, David Castel, Jordi Estelle, Gregory Guernec, Bernd Jagla, Luc Jouneau, Denis Laloë, Caroline Le Gall, Brigitte Schaëffer, Stéphane Le Crom, Mickaël Guedj, Florence Jaffrézic, French StatOmique Consortium
    A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.
    Brief Bioinform: 2013, 14(6);671-83
    [PubMed:22988256] ##WORLDCAT## [DOI] (I p)

  46. http://www.rna-seqblog.com/which-method-should-you-use-for-normalization-of-rna-seq-data/
  47. Pär G Engström, Tamara Steijger, Botond Sipos, Gregory R Grant, André Kahles, Gunnar Rätsch, Nick Goldman, Tim J Hubbard, Jennifer Harrow, Roderic Guigó, Paul Bertone, RGASP Consortium
    Systematic evaluation of spliced alignment programs for RNA-seq data.
    Nat Methods: 2013, 10(12);1185-91
    [PubMed:24185836] ##WORLDCAT## [DOI] (I p)

  48. Franck Rapaport, Raya Khanin, Yupu Liang, Mono Pirun, Azra Krek, Paul Zumbo, Christopher E Mason, Nicholas D Socci, Doron Betel
    Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data.
    Genome Biol: 2013, 14(9);R95
    [PubMed:24020486] ##WORLDCAT## [DOI] (I p)


[ Main_Page | NGS_data_analysis ]