RNASeq analysis for differential expression in GenePattern

From BITS wiki
Jump to: navigation, search

last edit, September 13 2016

[ Main_Page | NGS_data_analysis ]

BITS will install GenePattern servers in Leuven and Gent to support VIB scientists with the analysis of their RNA-Seq data. The GenePattern server in Leuven is running on the cluster of the VSC (Vlaams Supercomputer Centrum).

This wiki page therefore contains a list of how-tos describing different actions on the GenePattern server and different steps in the RNA-Seq analysis workflow.


Obtain a VSC account

Access the GenePattern server

Upload a file from your computer to our GenePattern server

Create a subfolder in the uploads folder

Find a tool in GenePattern

Tools in GenePattern are called modules.

Run a tool in GenePattern

Parameters of the tools

Parameters of Groomer

Groomer cleans messy fastq files and transforms them into standard fastq format.
To obtain a description of its parameters and their default values click the Documentation link at the top of the page.


One of the parameters you need to define is Input format: the encoding of the fastq file you want to clean. The encoding is important because it determines the offset of the quality scores (ASCII offset 33 or ASCII offset 64). If you're not sure you can check the encoding of your file in the FastQC report (take into account that FastQC sporadically makes the wrong guess).


Parameters of FastQC

FastQC generates a report describing the quality of the reads of a fastq file.
To obtain a description of its parameters and their default values click the Documentation link at the top of the page.


Parameters of Trimmomatic

Parameters of TopHat

Parameters of STAR

Parameters of Kallisto

  • Strand specificity: used for counting, none counts on both strands, forward runs kallisto in strand specific mode, only fragments where the first read in the pair pseudoaligns to the forward strand of a transcript are counted. If a fragment pseudoaligns to multiple transcripts, only the transcripts that are consistent with the first read are kept, reverse same as forward but the first read maps to the reverse strand of a transcript. The default for strand specificity is none.
  • Average fragment length: In the case of single-end reads, this parameter must be used to specify the average fragment length. Typical Illumina libraries produce fragment lengths ranging from 180–200 bp but it’s best to determine this from a library quantification with an instrument such as an Agilent Bioanalyzer. For paired-end reads, the average fragment length can be directly estimated from the reads and the program will do so.

Kallisto produces the following output file:
abundances.tsv is a plaintext file of the abundance estimates. It is not in the right format to be read by sleuth. It is therefore better to analyze it further with EdgeR or DESeq2. The first line contains a header for each column, including estimated counts (to be used by DESeq2), TPM (Transcripts per million: do not use this for DE analysis with DESeq2), effective length.

Parameters of RSEQC

Parameters of Picard

Parameters of HTSeq-count

  • Input format: you have to specify the format of your input file: bam or sam.
  • Strandedness: none counts on both strands (= total count), forward only counts reads that map to in the same strand as the gene, reverse only counts reads that map to the opposite strand as the gene. The default for strandedness is yes. If your RNA-Seq data has not been made with a strand-specific protocol, this causes half of the reads to be lost. Hence, make sure to set this to none unless you have strand-specific data!
  • Min qual: minimum mapping quality (phred score) to count a read.
  • Mode: specifies how stringent the counting should be with intersec_strict as the most stringent option and union as the least stringent. See slides and documentation for a detailed explanation.

Interpretation of the output:
Output is a table with counts for each gene. At the bottom you find the special counters: which count reads that were not counted for various reasons. The names of the special counters all start with a double underscore. The special counters are:

  • __no_feature: reads (or read pairs) which could not be assigned to any gene.
  • __ambiguous: reads (or read pairs) which could have been assigned to more than one gene and hence were not counted for any of these.
  • __too_low_aQual: reads (or read pairs) which were skipped due to the min qual setting.
  • __not_aligned: reads (or read pairs) in the bam file without alignment.
  • __alignment_not_unique: reads (or read pairs) with more than one reported alignment.

Define the input file of a tool in GenePattern

Load a file from your Files tab

Load a file from a shared folder

We have created a shared folder on the server where we can put files that can be accessed by everyone. This folder appears as a subfolder of the Uploads folder.

Upload the input file from your compüter

Click the Upload File button. However, since we work with large input files it's better to upload the files before you start the analysis (see section on uploading files for details) and load them from the Files tab. That's way faster than having to upload the file each time you want to use it in a tool.

Upload a file from the internet

Processing the output of a tool

View the content of a results file

You can view the results of a tool in your browser

Store the output of a tool on the Files tab

You can store the output of a tool in your uploads folder so that you can use as input for other tools later on.

Save the results file to your computer

Identification of DE genes in R

See this page for the workflow in R.

The training

Aims of the RNA-Seq analysis training

Use publicly available Illumina RNA-Seq data to:

  • Walk through a complete analysis workflow including read QC, read mapping to the human hg19 genome, and counting the reads mapping to each human transcript.
  • Perform differential expression analysis using popular Bioconductor tools and obtain data ready for functional analysis and biological interpretation.

The skills acquired during this session should allow reproduction of the workflow.

Not covered during this training

This RNA-Seq training will fit the needs of scientists who wish to compare experimental conditions to identify differentially expressed genes. Other RNA-Seq applications including the analysis of gene regulation at the chromatin level, by ncRNA or miRNA are not covered here.

  • This protocol is not valid to analyze time series experiments
  • This protocol is not valid for dose response analysis
  • This protocol is not valid to analyze survival effects associated with gene expression
  • This protocol does not de-novo assemble reads but instead maps reads to a known reference (it is therefore not applicable to organisms for which no transcriptome is available)
  • Although this workflow uses transcript information during the mapping, this protocol does not consider splice variants and maps all reads to gene models
  • The training stops with the definition of differentially expressed genes and does not include downstream processing of the results at the functional level. For this, please refer to follow-up sessions covering the use of Ingenuity Pathway Analysis or related tools.


Skills required to follow this training:

  • Basic knowledge of human genome and transcriptome structure
  • Basic knowledge of Illumina NGS read structure

All attendees who lack knowledge about Illumina reads should follow the preliminary one day 'Introduction to the analysis of NGS data' training ([1]).

More BITS training info

BITS website: https://www.bits.vib.be/index.php/training/
  1. http://wiki.bits.vib.be/index.php/NGS_data_analysis