PubMA Exercise.6

From BITS wiki
Jump to: navigation, search

Full analysis workflow using CLC main workbench

[ Main_Page | Hands-on Analysis of public microarray datasets | PubMA_Exercise.5 | PubMA_Exercise.6 |
| Analyze GEO data with the Affymetrix software ]



pubma2014_CLCMain-workflow.png

Technical.png The following content was directly taken from the current CLC documentation (Feb17, 2014) and applies only for people with access to the CLC Main workbench

CLC Tutorial material

Users from VIB labs or people having a license for the CLC Main workbench can proceed with this exercise during the afternoon open session or later from home. The final CLC project folder can be downloaded from the BITS server ('Heart vs Diaphragm.zip') and imported in the User CLC project-manager.

Obtaining public microarray data for the Workbench

The Workbench supports analysis of one-color expression arrays. These may be imported from GEO. The Workbench supports the following formats:

  • GEO SOFT sample-files: simple line-based, plain text files that contain all the data and the descriptive information of a microarray experiment (example SOFT-file)
  • GEO series-file: txt-files containing the definitions of a group of related samples. They contain tables describing extracted data, summary conclusions and analyses. Each Series-file is assigned a unique GEO accession number (GSExxx).

See our tutorial on downloading data from GEO for a detailed description on how to obtain the required files for import into the Workbench. If the file is compressed, unzip it before you import it in the Workbench.

Getting your own Affymetrix microarray data into the correct format for the Workbench

The Workbench assumes that expression values are given at the gene (probe set) level, thus probe-level analysis of Affymetrix arrays and import of Affymetrix CEL- and CDF-files is not supported. However, you can import your own Affymetrix data via two ways:

  • as .CHP files generated by Affymetrix Expression Console containing normalized Affymetrix data
    See the section on how to convert .CEL files to .CHP files using the Expression Console for a detailed discussion on how to do this. Use RMA for the normalisation !
  • as .txt files exported from R containing normalized Affymetrix data


Loading data into the workbench

Open the Workbench, create a New Folder called Microarrays in the Navigation Area and Import the unzipped file into the Microarrays folder. Expand the Microarrays folder and you will see 12 arrays in your Navigation Area.


ImportArrays.jpg

Double clicking a file will open it in the Main Area. The first column of the microarray files contains probe set IDs like 200007_at, 200011_s_at, 200012_x_at. The second column contains signals.
Always check the signals ! They should be small, all below 15: this means that the file contains log transformed and normalized data. If this is not the case, you can open the file in a text editor like Notepad and check the headers of the file: they will contain a description of the processing steps that were done on the data. You will find descriptions like: gene expression was quantified by robust multi-array analysis (RMA) using the Genomic Suite software from Partek or Robust Multichip Average (RMA) method was used to do background correction, quantile normalization, and expression level summarization
If you are sure that the data has been normalized and log transformed, you can immediately perform statistical analysis on the data. Otherwise, you can do log transformation in CLC.

In the file that we are using, no description of the data processing is given. From the values we see that log transformation has not been done but we do not know if normalization has been performed or not. Some form of processing must have been performed, otherwise we would have expression values per probe (as in raw Affymetrix data) instead of per probe set (= gene) as is the case here.


Setting up a microarray experiment in the Workbench

To analyze differential expression, you need to tell the Workbench how the samples are related. This is done by setting up an Experiment. An Experiment is the central data type when analyzing expression data in the Workbench. It includes a set of samples and information about how the samples are related (which groups they belong to).


Log transformation of the data

If the data has not yet been log transformed as is the case in our example, it is mandatory that you do so before you start generating plots or analyze the data.

If you work with your data, you will not need to do this step because you have normalized your data by RMA prior to import into the Workbench. The RMA normalization algorithm performs a log transformation on the data. If, for any reason, you used the MAS algorithm for normalization, log transformation will not have been done and you need to do it now.

If you work with public data, normalization should have been performed because the Workbench only accepts SOFT or Series files (= normalized data). Since most people use RMA for normalization the data should be log transformed. However, this is not always the case because of mistakes made during submission, the usage of normalization algorithms that do not perform a log transformation or taking the antilogs after normalization.

Mortasecca.png Warning:
Although the Workbench gives you the option, never use square root transformation on microarray data. Always choose a log transformation


Quality control of the data

MA plots

In most cases, you start with data that is already normalized in the Workbench so you cannot make the typical before and after normalization MA plots. But even if you can only generate after normalization MA plots you can still check whether the non-linear effect was more or less removed by the normalization (see slides).


MAplot3.jpg

The symmetric and even spread of the data points around the M=0 line indicates that the non-linear effect has been removed by the normalization.

Box plot

Again, you will not be able to make the typical before and after normalization box plots. But even if you can only generate after normalization box plots you can still check whether the data distributions of the samples are more or less equalized by the normalization (see slides).


BoxPlot.jpg

As expected after normalization, the box plot of our data looks very good because none of the samples stands out from the rest. The different arrays have the same (or at least a very comparable) median expression level. Also the scales of the boxes are almost the same indicating that the spread of the data on the different arrays is comparable.

You do see small differences e.g. on the second diaphragm array the spread of the data is a bit larger than on the other arrays. That is normal since RMA normalization performs an additional step (probe level normalization) after equalizing the data distributions. This step will reintroduce some variation between the data distributions of different samples.

If you see large differences between the boxes, you can equalize the data distributions by performing a normalization in the Workbench (Toolbox -> Transformation and Normalization -> Normalize)


PCA plot


PCA1.png

PCA performs a transformation of the data onto principal components (PCs). PCs are directions of the largest variation in the data set.

The two PCs are directions (= axes in space) so you can project each sample on these axes. The spread of the projected samples will be the maximum among all possible choices of axes. Projecting the samples on these PCs generates the plot that you see above.

When you have two groups in your data you should see a clear distinction between the two groups as is the case in our example. In the plot, the dots are colored according to the groups. To see which sample corresponds to a dot, place the mouse cursor on the dot for a second, and you will see the name of the sample.

Additional info: clear explanation of PCA

To complement the PCA, we will also do a hierarchical clustering of the samples to see if the samples cluster in the groups we expect.


Cluster.png

The clustering generates a heat map showing the cluster tree at the bottom. The two overall groups formed by the clustering are identical to the grouping in the experiment, which is what we want. You can double-check by placing your mouse on the name of the sample - that will show which group it belongs to.

Since both the PCA and the clustering confirms the grouping of the samples, we have no reason to be sceptical about the quality of the samples.


Identification of DE genes

Mortasecca.png Warning:
The standard t-test that is used by the Workbench to find DE genes is known to have poor sensitivity in microarray data analysis (see slides) and a moderated version of the t-test is normally used (cfr limma). Therefore, it is possible that the Workbench produces over-optimistic results. You have to realize this when you use the lists of DE genes for further analysis.

In an experiment with 3 or more test you normally do ANOVA first to check if there is a difference between the groups, followed by pairwise tests that determine which group differs from which group exactly. You can mimic this in the Workbench by first doing an ANOVA and then pairwise t-tests. You can choose to do t-tests for all pairs of groups by clicking the All pairs button or to have a t-test produced for each group compared to a specific reference group by clicking the Against reference button. In the latter case you must specify which group you use as reference.

Mortasecca.png Warning:
Normally you use Tukey's test as a follow-up to ANOVA. Tukey's test is similar but not identical to a t-test and it does correct for making multiple comparisons. Doing pairwise t-tests will not correct for making multiple comparisons!


Main results and specific settings

Only key steps are reproduced here to provide information to interpret the figures. All other steps and parameters will be explained in the tutorial PDF files linked above.

After computing group-wise differential expression, a filtering step is applied to the full table to retain only DE genes with at least 2-fold change in expression, with an adjusted p-value of at most 5x10-3 and with expression data (present calls) in at least 4 of the 6 replicates.


DElist-filtering.png

The classical volcano plot is produced with in red the 142 DE genes selected during filtering. This subset will be used as 'test' set against all other genes in the data table in the hypergeometric enrichment analyses detailed below.


Heart_Diaph-volcano-plot_142.png

Technical.png Due to the logarithmic nature of the data (transformed), the 'Difference' column should be used instead of 'Fold-Change' to represent the differential expression

Enrichment analysis in CLC Main

Hypergeometric test

The following figure shows data-samples used in the hypergeometric tTest


HyperGeo-tTest.png

A window allows selecting the annotation type to be used in the test and the action to take with duplicate probes (here: 'merged by gene symbol to the highest IQR').


select-annotation-duplo.png


HG-tTest-BP-settings.png

The next figure details the Top-30 hypergeometric results for the contrast Heart-vs-diaphragm


HypGeo_top30-BP.png

Another page presents details about the parameters used in the different tests

  • results of the hypergeometric test for GO-BP
HG-tTest-MF-settings.png
  • results of the hypergeometric test for GO-MF
HG-tTest-MF-settings.png
  • results of the hypergeometric test for Pathways
HG-tTest-PW-settings.png

GSEA

The GSEA method does not require partitioning the data as for the hypergeometric test, it takes the full table and considers the relative ranking of gene list members in relation to the individual gene expression levels in the data.

  • Settings for the GSEA test for GO-BP
GSEA-BP-settings.png
  • Settings for the GSEA test for GO-MF
GSEA-MF-settings.png
  • Settings for the GSEA test for Pathways
GSEA-PW-settings.png

Top-10 GSEA results for BP increased in Heart

GSEA-BP-top10-heart.png

Top-10 GSEA results for BP increased in Diaphragm

GSEA-BP-top10-Diaphragm.png

Additional Info

Tutorials on microarray analysis from CLC:

Published results

The original publication ends with functional enrichment results identifying key differences between 'heart' and 'diaphragm' tissues in rats. We link here to the results published by 'van Lunteren E, Spiegler S, Moyer M' [1]. Full details about this dataset can be found on the http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=+GSE6943 GEO page.

download exercise files

Download exercise files here.

Use the right application to open the files present in ex6-files

References:
  1. Erik van Lunteren, Sarah Spiegler, Michelle Moyer
    Contrast between cardiac left ventricle and diaphragm muscle in expression of genes involved in carbohydrate and lipid metabolism.
    Respir Physiol Neurobiol: 2008, 161(1);41-53
    [PubMed:18207466] ##WORLDCAT## [DOI] (P p)

    http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=+GSE6943


[ Main_Page | Hands-on Analysis of public microarray datasets | PubMA_Exercise.5 | PubMA_Exercise.6 |
| Analyze GEO data with the Affymetrix software ]