RNASeq analysis for differential expression in GenePattern

last edit, September 13 2016

BITS will install GenePattern servers in Leuven and Gent to support VIB scientists with the analysis of their RNA-Seq data. The GenePattern server in Leuven is running on the cluster of the VSC (Vlaams Supercomputer Centrum).

This wiki page therefore contains a list of how-tos describing different actions on the GenePattern server and different steps in the RNA-Seq analysis workflow.

How-tos

Obtain a VSC account

Access the GenePattern server

Upload a file from your computer to our GenePattern server

How to upload files from your local hard drive to GenePattern ?
* Go to the Files tab Drag and drop the files into the area appropriately called Drag files here Select the folder to store them in, typically you choose the Uploads folder GenePattern shows the progress of the upload

Create a subfolder in the uploads folder

How to create a subfolder in the uploads folder on the Files tab ?
* Click uploads. A window will appear where you can choose to create a subfolder, upload files or upload a folder. Click Create subdirectory Type a name for the subfolder Click Create

Find a tool in GenePattern

Tools in GenePattern are called modules.

How to find a tool ?
You can find a module by typing its name into the search box on the Modules tab:

Run a tool in GenePattern

How to run a tool ?
Searching a tool makes its name appear in the main window. Clicking its name will open its parameter form in the main window. Fill in the parameter form and click Run to start the analysis. Check the section on how to define the input of a tool for more info. Here you find a list of tools with an overview of their main parameters. As long as the tool is running you see an arched arrow in the top right corner: When the tool has finished the arched arrow is replaced by a checkmark and the file(s) containing the results appear at the bottom: Note that apart from the file containing the results (= the groomed file), a file stdout.txt is generated containing the error log of the tool.

Parameters of the tools

Parameters of Groomer

Groomer cleans messy fastq files and transforms them into standard fastq format.
To obtain a description of its parameters and their default values click the Documentation link at the top of the page.

One of the parameters you need to define is Input format: the encoding of the fastq file you want to clean. The encoding is important because it determines the offset of the quality scores (ASCII offset 33 or ASCII offset 64). If you're not sure you can check the encoding of your file in the FastQC report (take into account that FastQC sporadically makes the wrong guess).

Parameters of FastQC

FastQC generates a report describing the quality of the reads of a fastq file.
To obtain a description of its parameters and their default values click the Documentation link at the top of the page.

Parameters of Trimmomatic

Parameters of TopHat

Parameters of STAR

Parameters of Kallisto

Strand specificity: used for counting, none counts on both strands, forward runs kallisto in strand specific mode, only fragments where the first read in the pair pseudoaligns to the forward strand of a transcript are counted. If a fragment pseudoaligns to multiple transcripts, only the transcripts that are consistent with the first read are kept, reverse same as forward but the first read maps to the reverse strand of a transcript. The default for strand specificity is none.
Average fragment length: In the case of single-end reads, this parameter must be used to specify the average fragment length. Typical Illumina libraries produce fragment lengths ranging from 180–200 bp but it’s best to determine this from a library quantification with an instrument such as an Agilent Bioanalyzer. For paired-end reads, the average fragment length can be directly estimated from the reads and the program will do so.

Kallisto produces the following output file:
abundances.tsv is a plaintext file of the abundance estimates. It is not in the right format to be read by sleuth. It is therefore better to analyze it further with EdgeR or DESeq2. The first line contains a header for each column, including estimated counts (to be used by DESeq2), TPM (Transcripts per million: do not use this for DE analysis with DESeq2), effective length.

Parameters of RSEQC

Parameters of Picard

Parameters of HTSeq-count

Input format: you have to specify the format of your input file: bam or sam.
Strandedness: none counts on both strands (= total count), forward only counts reads that map to in the same strand as the gene, reverse only counts reads that map to the opposite strand as the gene. The default for strandedness is yes. If your RNA-Seq data has not been made with a strand-specific protocol, this causes half of the reads to be lost. Hence, make sure to set this to none unless you have strand-specific data!
Min qual: minimum mapping quality (phred score) to count a read.
Mode: specifies how stringent the counting should be with intersec_strict as the most stringent option and union as the least stringent. See slides and documentation for a detailed explanation.

Interpretation of the output:
Output is a table with counts for each gene. At the bottom you find the special counters: which count reads that were not counted for various reasons. The names of the special counters all start with a double underscore. The special counters are:

__no_feature: reads (or read pairs) which could not be assigned to any gene.
__ambiguous: reads (or read pairs) which could have been assigned to more than one gene and hence were not counted for any of these.
__too_low_aQual: reads (or read pairs) which were skipped due to the min qual setting.
__not_aligned: reads (or read pairs) in the bam file without alignment.
__alignment_not_unique: reads (or read pairs) with more than one reported alignment.

Define the input file of a tool in GenePattern

Load a file from your Files tab

How to use a file from your Files tab as input of a tool ?
Just drag and drop the file from the Files tab in the area that is labeled Drag Files Here

Load a file from a shared folder

We have created a shared folder on the server where we can put files that can be accessed by everyone. This folder appears as a subfolder of the Uploads folder.

Upload the input file from your compüter

Click the Upload File button. However, since we work with large input files it's better to upload the files before you start the analysis (see section on uploading files for details) and load them from the Files tab. That's way faster than having to upload the file each time you want to use it in a tool.

Upload a file from the internet

How to use a file on the internet as input of a tool ?
Click the Add Path or URL button and provide the URL of the file

Processing the output of a tool

View the content of a results file

You can view the results of a tool in your browser

How to view the results of a tool ?
When a tool has finished output files are generated at the bottom of the page. Click the name of the output file. Output files that are not saved in the uploads folder are stored 7 days on the server and are visible via the Jobs tab. So if you want to store the output of a tool that ran a few days ago, go to the Jobs tab and click the name of the output file. Select Open Link

Store the output of a tool on the Files tab

You can store the output of a tool in your uploads folder so that you can use as input for other tools later on.

How to store the output of a file in the uploads folder on the Files tab ?
When a tool has finished output files are generated at the bottom of the page. Click the name of the output file. Output files that are not saved in the uploads folder are stored 7 days on the server and are visible via the Jobs tab. So if you want to store the output of a tool that ran a few days ago, go to the Jobs tab and click the name of the output file. Select Copy to Files Tab

Save the results file to your computer

How to store the output of a file on your computer ?
When a tool has finished output files are generated at the bottom of the page. Click the name of the output file. Output files that are not saved in the uploads folder are stored 7 days on the server and are visible via the Jobs tab. So if you want to store the output of a tool that ran a few days ago, go to the Jobs tab and click the name of the output file. Select Save File

Identification of DE genes in R

See this page for the workflow in R.

The training

Aims of the RNA-Seq analysis training

Use publicly available Illumina RNA-Seq data to:

Walk through a complete analysis workflow including read QC, read mapping to the human hg19 genome, and counting the reads mapping to each human transcript.
Perform differential expression analysis using popular Bioconductor tools and obtain data ready for functional analysis and biological interpretation.

The skills acquired during this session should allow reproduction of the workflow.

Not covered during this training

This RNA-Seq training will fit the needs of scientists who wish to compare experimental conditions to identify differentially expressed genes. Other RNA-Seq applications including the analysis of gene regulation at the chromatin level, by ncRNA or miRNA are not covered here.

This protocol is not valid to analyze time series experiments
This protocol is not valid for dose response analysis
This protocol is not valid to analyze survival effects associated with gene expression
This protocol does not de-novo assemble reads but instead maps reads to a known reference (it is therefore not applicable to organisms for which no transcriptome is available)
Although this workflow uses transcript information during the mapping, this protocol does not consider splice variants and maps all reads to gene models
The training stops with the definition of differentially expressed genes and does not include downstream processing of the results at the functional level. For this, please refer to follow-up sessions covering the use of Ingenuity Pathway Analysis or related tools.

Prerequisites

Skills required to follow this training:

Basic knowledge of human genome and transcriptome structure
Basic knowledge of Illumina NGS read structure

All attendees who lack knowledge about Illumina reads should follow the preliminary one day 'Introduction to the analysis of NGS data' training (^[1]).

More BITS training info

BITS website: https://www.bits.vib.be/index.php/training/

↑ http://wiki.bits.vib.be/index.php/NGS_data_analysis

[1] ttp://wiki.bits.vib.be/index.php/NGS_data_analysis

[1]