Exercises on Gene Expression

From BITS wiki
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training

*Coexpression networks

  • Go to the COXPRESS db page.
  • Search for TLN2 (hint: this is a Gene alias).
  • Go to the list of coexpressed genes of human talin 2.
  • Click TLN2 next to Hsa2 genes to visualize the coexpression network.

*Tissue specific expression in Ensembl

Ensembl offers baseline expression data from EBI's Gene Expression Atlas.

The Baseline Atlas consists of highly curated RNA–seq experiments under normal conditions from ArrayExpress. It focuses on quantifying mRNA but also contains expression values of long coding and non-coding RNA. Normal conditions include different tissues and cell types but also develepmental stages for some organisms. Data is available for many organisms like human, mouse, rat, rice, Arabidopsis, worm, ...

The selected experiments are re–analyzed using EBI's RNA–seq processing pipeline resulting in a table of normalized gene expression levels expressed in TPM (transcripts per kilobase million). TPM is calculated as follows:

  • Normalization for gene length: divide the counts by the length of each gene in kilobases = reads per kilobase (RPK).
  • Normalization for library size: add up all RPK values in a sample and divide this number by 1,000,000. This is the “per million” scaling factor. Divide the RPK values by the “per million” scaling factor = TPM.

When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. If the TPM for gene A in Sample 1 is 3.33 and the TPM in sample B is 3.33, then you know that the exact same proportion of total reads mapped to gene A in both samples.

qPCR

*Finding stably expressed reference genes

First use RefGenes, a tool that is present in both the free and commercial GeneVestigator platform, to select the 8 best candidate reference genes.

Handicon.png VIB scientists can use the commercial version of GeneVestigator for free. See installation instructions on the BITS website.


Next, perform a qPCR experiment measuring the expression of these 8 candidate reference genes in a representative set of your samples and analyze the data using geNorm, a tool that is integrated in qBase+.

Handicon.png Qbase+ is commercial software that VIB offers for free to VIB scientists. See installation instructions on the BITS website.


Designing your qPCR experiment

Tips and considerations for designing accurate and consistent qPCR experiments

Analyzing qPCR data using qbase+ (VIB only)

Learn how to analyze your qPCR data in qbase+

Handicon.png
Qbase+ is commercial software that VIB offers for free to VIB scientists. See installation instructions on the BITS website.


Microarrays

*Finding relevant public microarray data

Exercise on searching GEO to find available microarray data sets that are relevant for your research.
The GEO database is a public repository that stores thousands of high-throughput gene expression studies submitted by the scientific community. The studies represent a large diversity of experimental types and designs, and contain data that are processed and normalized using a wide variety of methods.
GEO offers microarray data in two formats:

  • a set of CEL files containing the raw data
  • a single Series Matrix (.txt) file containing the normalized data for all samples. Many software packages to analyze microarray data, like GEO2R and the CLC Main Workbench start from these files.

    Handicon.png
    It's important to realize that these data are supposed to be already normalized and log-transformed. However, this is not always the case: some submitters provide raw data in these files, some designs do not allow for regular normalization...


Analyzing public microarray data

*Finding differentially expressed genes using GEO2R

Analysis in GEO2R consists of:

  • generating a box plot to compare the data distributions of the different samples. Allows you to check if the Series Matrix file indeed contains normalized data.
  • if necessary, log transforming the data
  • identification of the 250 most differentially expressed (DE, ranked by p-value) genes between 2 or more groups of samples


*Clustering and generating heat maps using the GEO DataSet Browser

Exercise on using GEO DataSet Browser for clustering

Analyzing Affymetrix data from GEO using Affymetrix software

Exercise on using Affymetrix TAC + Expression Console (Windows only !)

Using CLC Main Workbench for analysis of Affymetrix data

Exercise using the CLC Main workbench (for VIB users and CLC license owners)

Combining multiple public microarray experiments in GeneVestigator

Exercises on analyzing public microarray data in GeneVestigator(for VIB users and GeneVestigator license owners)

Analyzing your own microarray data

You can perform an analysis using Affymetrix software or CLC Main Workbench as described above. Note that you have to normalize your data first before you can load them into the Workbench: see our wiki section on how to get your data into the correct format for the Workbench.

If you are really in for a challenge, you might want to consider doing the analysis in R: check out our tutorial on analyzing microarray data in R/Bioconductor.

RNA-Seq analysis

Downloading public RNA-Seq data

Check out the Downloading NGS data from the internet page for a tutorial on obtaining public NGS data.

RNA-Seq analysis

If you are a newbie to the field, check out the introduction page for a simple step-by-step tutorial on RNA-Seq analysis using our GenePattern server (mail bits@vib.be to obtain an account). This tutorial covers all steps starting from raw reads to the generation of a count table whereas this page covers the identification of DE genes in R/Bioconductor.

If you are a more experienced Linux and R user you can check out the RNA-Seq analysis page for an advanced tutorial on how to identify differentially expressed genes based on RNA-Seq data using Linux command line tools. This tutorial covers the complete RNA-Seq analysis workflow and makes use of more complicated bash and R scripts. Not for the faint hearted !