Analyze your own microarray data in R/Bioconductor

From BITS wiki
Jump to: navigation, search

On this page you can find the R-code to identify DE genes based on Affymetrix microarray data



Install R (and RStudio)

Check out our R introduction tutorial to learn how to install R and RStudio.

Install the required R packages

To analyze microarray data, you need a specific R package, called Bioconductor.
However, Bioconductor uses functions and object from various other R packages, so you need to install these R packages too:

  • Matrix
  • lattice
  • fdrtool
  • rpart
So Bioconductor will only work after you have installed these packages.

Additionally, you will need an R-package for making graphs of the data, called ggplot2

Check out our R introduction tutorial to learn how to install these packages.

Install Bioconductor packages from Bioconductor repository

You need the following Bioconductor packages for Affymetrix array analysis:

  • affy, affyPLM and simpleaffy if you want with older arrays (3' arrays)
  • oligo if you work with newer arrays (HTA, Gene ST...)
  • It is important to realize that it is best to pick one of these two choices. If you load multiple packages with similar functionality, e.g. both affy and oligo or both simpleaffy and oligo, R will become confused. The affy-based packages and oligo contain methods with the same name (e.g. intensity(), MAplot(), rma()) but with slightly different code. So R will not know if it has to use the intensity() method from the affy packages or from oligo. By default it chooses affy over oligo. So you always have to specify the packagename for the oligo methods (see section on how to specify the package of a method) and even then it does not always work well.

  • affydata if you need example data sets
  • ArrayExpress if you want to obtain data from ArrayExpress
  • limma
  • Biobase
  • Biostrings
  • genefilter
  • Bioconductor packages for the annotation of the spots on the arrays

How to install Bioconductor packages in R

Installing Bioconductor packages from source

Some Bioconductor packages will not install by running the biocLite() command. In that case you can always install them from source. In this case you go to the Bioconductor page of the package you wish to instal, as an example we take the Biostrings package (allthough it installs fine using the biocLite() command.

Updating installed Bioconductor packages

After some time, Bioconductor packages might become outdated. Bioconductor packages are updated fairly regularly.

Loading packages in R

Before you can do any programming in RStudio, you need to create a new project and a new script.
To use the installed R and BioConductor packages in R, you have to load them first.
Check out our R introduction tutorial to learn how to load these packages.

Using a method from a specific package

If you have loaded multiple packages with similar functionality, e.g. both affy and oligo or both affyPLM and oligo, R might become confused. Affy or affyPLM and oligo contain methods with the same name e.g. intensity(), MAplot(), rma()... but with slightly different code. If you now simply use the command


R will not know if it has to use the intensity() method from affy or from oligo. By default it chooses affy over oligo. To tackle this problem you have to specify the packagename in front of the name of the method for oligo, e.g.


and R will know that it has to use the intensity() method of the oligo package.

Getting help / documentation in R

Check out our R introduction tutorial to learn how to consult the R documentation.

Bioconductor is object-oriented R. It means that a package consists of classes. The classes define the behaviour and characteristics of a set of similar objects that belong to the class. The characteristics that objects of a class can have are called slots while the behaviour of the objects (the actions they can do) is described by the methods of a class.
For instance, there is a class AffyBatch, consisting of containers that hold microarray data in a structured way. You can create an AffyBatch object to hold your data. AffyBatches have many slots, one for the intensities, one for the sample annotation, one for the probe set annotation... There are numerous methods that operate on AffyBatches: you can retrieve perfect match intensities, you can create a boxplot, you can normalize the data...

The documentation of a class shows:

  • a short description of the class
  • how to create objects of the class
  • a list of slots of the class
  • a list of methods that can operate on objects of the class
  • some examples

Open the CEL-files in R

If you use Affymetrix chips your microarray data will consist of a series of CEL files containing raw intensities for each probe on the array. Save all CEL files you want to analyze in a single directory. They may be gzipped (*.gz) - you do not need to gunzip them.

Open CEL files from 3' Affymetrix Arrays (older ones) using affy

Data from older Affymetrix arrays should be analysed using the affy package.

If you are using a custom cdf file you need an additional line of code telling R that you are not using the standard Affymetrix cdf but a custom one.

Open CEL files from newer Affymetrix Arrays (HTA, Gene ST...) using oligo

For these arrays the old affy package doesn't work - instead you'll have to use the oligo package

Looking at the data

  • To view the full content of a simple object (vector, list, matrix or data frame): type the name of the object in the console.
  • To view specific items of a simple object: refer to their location using [ ] e.g.
    object[1,2] will retrieve the item on the first row and second column
    object[1:2,] will retrieve all columns of the first two rows
  • To view an entire column of a data frame: use $ e.g. dataframe$columnname
  • To access slots of complex objects: use object@ followed by the name of the slot.

Retrieving intensities

Retrieving intensities using affy

How to retrieve intensities using affy

Retrieving intensities using oligo

All code that is described above will work in the oligo package except for the code that use probeset IDs to retrieve their corresponding intensities.

However, if you have loaded oligo together with an affy-based package, affy, simpleaffy, affyPLM you will have to specify oligo:: in front of the intensity() and the pm() method (see section on specifying the package name of a method).

Retrieving annotation

Retrieving sample annotation using affy

Apart from the expression data itself, microarray data sets need to include information about the samples that were hybridized to the arrays, e.g. sample labels. AffyBatch objects have several slots (characteristics). One of them is called phenoData. It contains labels for the samples. Despite the name, there is no implication that the labels should be phenotypic, in fact they often indicate genotypes such as wild-type or knockout. They can also contain other types of information about the samples e.g. which samples are replicates, which samples were scanned by the same scanner, the amount of RNA that was hybridized to the arrays…
However, for most data sets the phenoData has not been defined.

If phenoData contains no information: CEL-files are given an index 1-n (n = total number of files). You can give the samples more accurate names so these can be used in the plots that we are going to create later on. We will see how to do this when we create the plots.

When you import CEL-files from GEO or ArrayExpress the phenoData should normally already contain informative names but many submitters skip this step so many data sets are not well annotated.

Retrieving probe annotation using affy

Microarray data sets should include information on the probes. AffyBatches have a slot called featureData, a data frame that contains labels for the probes. For most data sets (also public data coming from GEO or ArrayExpress) the featureData has not been defined.

Retrieving experiment annotation using affy

Microarray data sets should also include information on the experiment. AffyBatches have a slot for this called experimentData. For most data sets (also public data coming from GEO or ArrayExpress) the featureData has not been defined.

Retrieving annotation using oligo

All commands described above also work for the oligo package.

Retrieving other types of information

On Affymetrix arrays you should see an average of 10-20 probes per gene.

Quality control of microarray data

Giving the samples informative names

The code in this section works for both affy and oligo

If the phenoData object, that was created in the step where we retrieved the sample annotation, does not contain any information, Bioconductor will just give the CEL-files an index 1-6. However, the phenoData will be used as labels in plots. Give the samples more accurate names so they can be used in the plots that we are going to create.

You have to adjust this code according the number of groups and the number of replicates you have and change the sample names to names that are relevant for you. The order of the samples in the AffyBatch is determined by the CEL-file name of the sample (the CEL files are alphabetically ordered).

Create plots to assess the quality of the data

Instead of printing these plots in RStudio or the R editor, we will save the plots to our hard drive.

Microarray pictures

Microarray pictures can show large inconsistencies on individual arrays.

How to create microarray pictures


Chip pseudo-images

How to create chip pseudo-images



Another quality control check is to plot the distribution of log base 2 intensities (log2(PMij) for array i and probe j) of perfect match probes for comparison of probe intensity behavior between different arrays. If you see differences in shape or center of the distributions, it means that normalization is required.

How to create histograms of microarray data


Box plots

Boxplots and histograms show the same differences in probe intensity behavior between arrays. In order to perform meaningful statistical analysis and inferences from the data, you need to ensure that all the samples are comparable. To examine and compare the overall distribution of log transformed PM intensities between the samples you can use a histogram but you will get a clearer view with a box plot. Box plots show:

  • the median: center value, half of the intensities are lower than this value, half of the intensities are higher (= line in the box)
  • the upper quartile: a quarter of the values are higher than this quartile (= upper border of the box)
  • the lower quartile: a quarter of the values are lower than this quartile (= lower border of the box)
  • the range: minimum and maximum value (= borders of the whiskers)
  • individual extreme values (= points outside the whiskers)

How to create boxplots of microarray data


MA plots

Originally these MA plots were developed for two-color arrays to detect differences between the two color labels on the same array, and for these arrays they became hugely popular. This is why more and more people are now also using them for Affymetrix arrays but on Affymetrix only use a single color label. So people started using them to compare each Affymetrix array to a pseudo-array. The pseudo array consists of the median intensity of each probe over all arrays.

The MA plot shows to what extent the variability in expression depends on the expression level (more variation on high expression values?). In an MA-plot, A is plotted versus M:

  • M is the difference between the intensity of a probe on the array and the median intensity of that probe over all arrays
    M = logPMInt_array – logPMInt_medianarray
  • A is the average of the intensity of a probe on that array and the median intesity of that probe over all arrays
    A = (logPMInt_array + logPMInt_medianarray)/2

How to create MA plot of microarray data


Ideally, the cloud of data points in the MA-plot should be centered around M=0 (blue line). This is because we assume that the majority of the genes is not DE and that the number of upregulated genes is similar to the number of downregulated genes. Additionally, the variability of the M values should be similar across different array-medianarray combinations. You see that the spread of the point cloud increases with the average intensity: the loess curve (red line) moves further and further away from M=0 when A increases. To remove (some of this) dependency, we will normalize the data.

Calculate quality measures to assess the quality of the data

Calculate quality measures in affy

Apart from images, you can also calculate simple quality metrics to assess the quality of the arrays.

Calculate quality measures in oligo

The qc() method is implemented in the simpleaffy package but not in the oligo package. This means that you either have to

  • load both simpleaffy and oligo but then you always have the problem that you have to explicitly write oligo:: in front of every method you want to use from the oligo package (see section on using a method from a specific package)
  • first load simpleaffy, perform the qc() analysis, restart R, load oligo and perform the rest of the analysis

Normalization of microarray data

There are many sources of noise in microarray experiments:

  • different amounts of RNA used for labeling and hybridization
  • imperfections on the array surface
  • imperfect synthesis of the probes
  • differences in hybridization conditions
Systematic differences between the samples that are due to noise rather than true biological variability should be removed in order to make biologically meaningfull conclusions about the data.

Normalization using RMA

The code in this section works for both affy and oligo

The standard method for normalization is RMA. RMA is one of the few normalization methods that only uses the PM probes:

  • Background correction to correct for spatial variation within individual arrays: a background-corrected intensity is calculated for each PM probe in such a way that all background corrected intensities are positive
  • Log transformation to improve the distribution of the data: the base-2 logarithm of the background corrected intensity is calculated for each probe. The log transformation will make the data less skewed and more normally distributed and provide an equal spread of up- and downregulated expression ratios
  • Quantile normalization to correct for variation between the arrays: equalizes the data distributions of the arrays and make the samples completely comparable
  • Probe normalization to correct for variation within probe sets: equalizes the behavior of the probes between the arrays and combines normalized data values of probes from a probe set into a single value for the whole probe set

Check out this excellent overview of RMA.

Normalization using GCRMA

GCRMA in affy

GCRMA is based on RMA, having all the good sides of RMA. The difference lies in the background correction, all other steps are the same. GCRMA corrects for non-specific binding to the probes in contrast to RMA which completely ignores the issue of non-specific binding.

GCRMA uses probe sequence information to estimate probe affinity to non-specific binding. Each nucleotide of the probe contributes to the affinity of the probe. The contributions of each nucleotide on each position of the probes were estimated in a pilot experiment with artificial DNAs that mismatch the probes, done by the people who developed the algorithm. In this experiment there was no specific binding so the only thing was measured was non-specific binding. The data of this experiment allowed to estimate the affinities of the probes.

GCRMA allows you to use the affinities calculated based on this reference experiment or you can let GCRMA compute probe affinities based on the signals of the negative control probes of your own microarray experiment.

The gcrma package contains all available methods to perform GCRMA normalization.

GCRMA in oligo

GCRMA makes use of the IndexProbes() method a lot, which is a method that works on AffyBatches and is not implemented for FeatureSets. As a result, there is no easy way to do GCRMA normalization in the oligo package.

Checking the effect of the normalization

Comparing raw and background-corrected data

To see the effect of the background correction you can create a plot of raw versus background corrected data.

How to compare raw and background-corrected microarray data

Comparing raw and normalized data

Comparing raw and normalized data in affy

After normalization you can compare raw and normalized data. You can do this for individual genes or for all genes.
We'll start by making the comparison for individual genes.

Comparing raw and normalized data in oligo

After normalization oligo does use probe set IDs as row names in the data.matrix object so you can retrieve normalized data for a specific probe set e.g. for HTA 2.0 arrays:


However, retrieving raw data by specifying a probe set ID in the pm() method does not work in oligo.

Comparing raw and normalized data using boxplots

How to create boxplots of microarray data



After normalization, none of the samples should stand out from the rest. The different arrays should have a very comparable median expression level. Also the scale of the boxes should be very comparable indicating that the spread of the intensity values on the different arrays is equalized.

Comparing raw and normalized data using MA plots

How to create MA plot of microarray data


When you compare this plot to the one created for the raw intensities on the same array, you see a much more symmetric and even spread of the data points indicating that the dependence of the variability on the expression level is not as strong as it was before normalization.

PCA plot

To check whether the overall variability of the samples reflects their grouping, you can perform a Principal Component Analysis. In other words, you can check if replicates are homogenous and distinguishable from samples of (the) other group(s).

How to create a PCA plot of microarray data


When you have groups in your data this should lead to a clear distinction between the groups.

Identification of DE genes

Identification of DE genes is not done by the affy nor the oligo package but by the limma package. Limma uses the output of the rma() method (data.rma) as input. Since the output of the rma() method is the same in the affy and in the oligo package, limma works well with both packages. This means that all following code is valid for all normalized Affymetrix data regardless of the package that was used for normalization.

Two groups of samples

The limma package contains functions for using a t-test or an ANOVA to identify differential expression in microarray data. These functions can be used for all array platforms and work even for microarray data with complex designs of multiple samples. The central idea is to fit a linear model to the expression data of each gene. The expression data can be log-ratios or log-intensities. Computational methods are used to borrow information across genes making the analyses stable even for experiments with a small number of arrays. Limma is designed to be used in conjunction with the affy package.

In the above example we compared samples from two groups but in many microarray experiments, there are more than two groups. In this case we have to perform a moderated one-way ANOVA for each gene.

Three groups of samples

Paired data

Up to now we have always assumed that the groups are independent, meaning that samples of one group have no relation whatsoever with samples from another group. In independent data sets you have two groups of subjects (patients, mice, plants,....). You use one group as controls and you give the other group a treatment.
Paired or dependent data means that there exists a relationship between the samples of different groups. The typical example is having a group of subjects and measuring each of them before and after treatment.
It is important to tell limma if your data is paired or not since you need to use a different type of statistical test on paired compared to independent data:

  • paired t-test instead of regular two-samples t-test
  • repeated measures ANOVA instead of regular one-way ANOVA

Two grouping variables

Sometimes you have two grouping variables, e.g.

  • gender and treatment if you think a drug has a different effect on men and women
  • strain and treatment if you think a drug has a different effect on different mice strains

How to compare microarray data grouped according to two variables

Generating a Volcano plot

The best way to decide on the number of DE genes you are going to select is via a Volcano plot.


Volcano plots arrange genes along biological and statistical significance. The X-axis gives the log fold change between the two groups (log: so that up and down regulation appear symmetric), and the Y-axis represents the p-value of a t-test comparing samples (on a negative log scale so smaller p-values appear higher up). The first axis indicates biological impact of the change; the second indicates the statistical evidence of the change.
To decide on the number of DE genes that you’re going to proceed with, you can make Volcano plots highlighting different numbers of genes. As you increase the number of genes highlighted on the plot, you can see at what point you begin to select genes with rather low fold changes. Make a conservative decision about the number of genes you want to use for follow up.

Adjusting for multiple testing and defining DE genes

You are doing a t-test on each gene, meaning that you will be doing more than 20000 t-tests on the data set. You have to adjust the p-values of the t-tests for multiple testing or you will generate too many false positives.

Of course your final aim is to generate a table containing the genes that you consider DE (the genes with the lowest adjusted p-values and the most extreme log fold changes). You can then use this table to search for functional relations between these genes...

How to define a table of DE genes in a microarray experiment

Creating lists of probe set IDs of DE genes for functional analysis

We go back to our simple comparison of mutant and wild-type samples. We have created a list of upregulated genes called topups and a list of downregulated genes (topdowns).

What if you used decideTests() to identify DE genes ? This method generates a matrix containing labels (0, 1 or -1) for each gene in each contrast.

You'll probably want to write the list of IDs to your computer so that you can use it in tools like DAVID.

Creating a heat map of the DE genes


Creating a Venn diagram to compare results of multiple comparisons

This can be easily done on the output of the decideTests() method.