Analyze your own microarray data in R/Bioconductor

On this page you can find the R-code to identify DE genes based on Affymetrix microarray data

1 Installation
2 Loading packages in R
3 Using a method from a specific package
4 Getting help / documentation in R
5 Open the CEL-files in R
- 5.1 Open CEL files from 3' Affymetrix Arrays (older ones) using affy
- 5.2 Open CEL files from newer Affymetrix Arrays (HTA, Gene ST...) using oligo
6 Looking at the data
7 Quality control of microarray data
8 Normalization of microarray data
9 Identification of DE genes

Installation

Install R (and RStudio)

Check out our R introduction tutorial to learn how to install R and RStudio.

Install the required R packages

To analyze microarray data, you need a specific R package, called Bioconductor.
However, Bioconductor uses functions and object from various other R packages, so you need to install these R packages too:

Matrix
lattice
fdrtool
rpart

So Bioconductor will only work after you have installed these packages.

Additionally, you will need an R-package for making graphs of the data, called ggplot2

Check out our R introduction tutorial to learn how to install these packages.

Install Bioconductor packages from Bioconductor repository

You need the following Bioconductor packages for Affymetrix array analysis:

affy, affyPLM and simpleaffy if you want with older arrays (3' arrays)
oligo if you work with newer arrays (HTA, Gene ST...)

It is important to realize that it is best to pick one of these two choices. If you load multiple packages with similar functionality, e.g. both affy and oligo or both simpleaffy and oligo, R will become confused. The affy-based packages and oligo contain methods with the same name (e.g. intensity(), MAplot(), rma()) but with slightly different code. So R will not know if it has to use the intensity() method from the affy packages or from oligo. By default it chooses affy over oligo. So you always have to specify the packagename for the oligo methods (see section on how to specify the package of a method) and even then it does not always work well.

affydata if you need example data sets
ArrayExpress if you want to obtain data from ArrayExpress
limma
Biobase
Biostrings
genefilter
Bioconductor packages for the annotation of the spots on the arrays

How to install Bioconductor packages in R

Installing Bioconductor packages from source

Some Bioconductor packages will not install by running the biocLite() command. In that case you can always install them from source. In this case you go to the Bioconductor page of the package you wish to instal, as an example we take the Biostrings package (allthough it installs fine using the biocLite() command.

How to install this package from source ?
Scroll down the Bioconductor page to the Package Archives section Select the file that corresponds to your operating system, e.g. for windows the zip file Download the file to your computer and use the following command to install it: install("C:\\path_to_the_file\\Biostrings_2.36.0.zip", repos = NULL, type="source")

Updating installed Bioconductor packages

After some time, Bioconductor packages might become outdated. Bioconductor packages are updated fairly regularly.

How to update BioConductor packages in R ?
To identify packages that can be updated within your version of Bioconductor, start a new session of R and enter the following commands: source("http://www.bioconductor.org/biocLite.R") biocLite()

Loading packages in R

Before you can do any programming in RStudio, you need to create a new project and a new script.
To use the installed R and BioConductor packages in R, you have to load them first.
Check out our R introduction tutorial to learn how to load these packages.

Using a method from a specific package

If you have loaded multiple packages with similar functionality, e.g. both affy and oligo or both affyPLM and oligo, R might become confused. Affy or affyPLM and oligo contain methods with the same name e.g. intensity(), MAplot(), rma()... but with slightly different code. If you now simply use the command

intensity(data)

R will not know if it has to use the intensity() method from affy or from oligo. By default it chooses affy over oligo. To tackle this problem you have to specify the packagename in front of the name of the method for oligo, e.g.

oligo::intensity(data)

and R will know that it has to use the intensity() method of the oligo package.

Getting help / documentation in R

Check out our R introduction tutorial to learn how to consult the R documentation.

Bioconductor is object-oriented R. It means that a package consists of classes. The classes define the behaviour and characteristics of a set of similar objects that belong to the class. The characteristics that objects of a class can have are called slots while the behaviour of the objects (the actions they can do) is described by the methods of a class.
For instance, there is a class AffyBatch, consisting of containers that hold microarray data in a structured way. You can create an AffyBatch object to hold your data. AffyBatches have many slots, one for the intensities, one for the sample annotation, one for the probe set annotation... There are numerous methods that operate on AffyBatches: you can retrieve perfect match intensities, you can create a boxplot, you can normalize the data...

How to show the documentation of a class ?
You can ask for documentation of classes: help(classname)

The documentation of a class shows:

a short description of the class
how to create objects of the class
a list of slots of the class
a list of methods that can operate on objects of the class
some examples

How to check which slots are available for an object ?
To find out which slots are available for a object, use the following command: slotNames(objectname)

Open the CEL-files in R

If you use Affymetrix chips your microarray data will consist of a series of CEL files containing raw intensities for each probe on the array. Save all CEL files you want to analyze in a single directory. They may be gzipped (*.gz) - you do not need to gunzip them.

Open CEL files from 3' Affymetrix Arrays (older ones) using affy

Data from older Affymetrix arrays should be analysed using the affy package.

How to open CEL files using affy ?
You have to tell R where your CEL files are stored. R cannot find them by itself. So you have to give R the path on your computer of the folder that contains the CEL files. # specify the path on your computer where the folder that contains the CEL-files is located celpath = "D:/R-2.15.2/library/affydata/celfiles/Apum/" Then you can import all the CEL files by a single command using the ReadAffy() method. ReadAffy will read all CEL files in the folder and load them into an AffyBatch object in R. You use the celfile.path argument to specify the location of the folder that contains the CEL files. # import CEL files containing raw probe-level data into an R AffyBatch object data = ReadAffy(celfile.path=celpath) So data is now an AffyBatch object containing the data from your CEL files.

If you are using a custom cdf file you need an additional line of code telling R that you are not using the standard Affymetrix cdf but a custom one.

How to use a custom cdf file from BrainArray ?
If you want to use a custom cdf from the BrainArray website you have to indicate to R you want to use the custom cdf before you run the ReadAffy method. To this end, AffyBatches have a slot called cdfName in which the name of the cdf file that is to be used for the analysis is stored. If you want to use a custom cdf you have to store its name in the cdfName slot of your AffyBatch object. Slots can be accessed using the “@” sign. So you can access a slot by using object@ followed by the name of the slot. # indicate you want to use the custom cdf # If you don't specify the cdfname, BioConductor will use the default Affymetrix cdf. data@cdfName="ATH1121501_At_TAIRT" You can find the name of the custom cdf by going to the package folder: Open the DESCRIPTION file and look in the Description line:

Open CEL files from newer Affymetrix Arrays (HTA, Gene ST...) using oligo

For these arrays the old affy package doesn't work - instead you'll have to use the oligo package

How to open CEL files using oligo ?
You have to tell R where your CEL files are stored. R cannot find them by itself. So you have to give R the path on your computer of the folder that contains the CEL files. # specify the path on your computer where the folder that contains the CEL-files is located celpath = "D:/R-2.15.2/library/affydata/celfiles/HTA/" The list.files() command should be used to obtain the list of CEL files in the folder that was specified by the celpath. You should therefore make sure that this folder only contains the CEL files you want to analyze. Then you can import all the CEL files by a single command using the read.celfiles() method. The oligo package does not save the data in an AffyBatch (as affy does) but uses different containers depending on the type of arrays used, e.g. HTAFeatureSet for HTA arrays, ExonFeatureSet for Exon ST arrays... This approach allows oligo to behave differently depending on the type of array. # import CEL files containing raw probe-level data into an R AffyBatch object list = list.files(celpath,full.names=TRUE) data = read.celfiles(list) So data is now a specific FeatureSet object containing the data from your CEL files.

Looking at the data

To view the full content of a simple object (vector, list, matrix or data frame): type the name of the object in the console.
To view specific items of a simple object: refer to their location using [ ] e.g.
object[1,2] will retrieve the item on the first row and second column
object[1:2,] will retrieve all columns of the first two rows
To view an entire column of a data frame: use $ e.g. dataframe$columnname
To access slots of complex objects: use object@ followed by the name of the slot.

Retrieving intensities

Retrieving intensities using affy

How to retrieve intensities using affy

Retrieving intensities using oligo

All code that is described above will work in the oligo package except for the code that use probeset IDs to retrieve their corresponding intensities.

However, if you have loaded oligo together with an affy-based package, affy, simpleaffy, affyPLM you will have to specify oligo:: in front of the intensity() and the pm() method (see section on specifying the package name of a method).

Retrieving annotation

Retrieving sample annotation using affy

Apart from the expression data itself, microarray data sets need to include information about the samples that were hybridized to the arrays, e.g. sample labels. AffyBatch objects have several slots (characteristics). One of them is called phenoData. It contains labels for the samples. Despite the name, there is no implication that the labels should be phenotypic, in fact they often indicate genotypes such as wild-type or knockout. They can also contain other types of information about the samples e.g. which samples are replicates, which samples were scanned by the same scanner, the amount of RNA that was hybridized to the arrays…
However, for most data sets the phenoData has not been defined.

How to retrieve the sample annotation of the data ?
ph = data@phenoData ph As you can see, ph is a data frame. Remember that you can select a column of a data frame using the $ sign. You find the names of the columns in varLabels: there is one column named sample ph$sample To look at all the data in the data frame ask for the data slot. ph@data Another way of looking at the sample annotation is using the pData() method on the AffyBatch object: pData(data)

If phenoData contains no information: CEL-files are given an index 1-n (n = total number of files). You can give the samples more accurate names so these can be used in the plots that we are going to create later on. We will see how to do this when we create the plots.

When you import CEL-files from GEO or ArrayExpress the phenoData should normally already contain informative names but many submitters skip this step so many data sets are not well annotated.

Retrieving probe annotation using affy

Microarray data sets should include information on the probes. AffyBatches have a slot called featureData, a data frame that contains labels for the probes. For most data sets (also public data coming from GEO or ArrayExpress) the featureData has not been defined.

How to retrieve the probe annotation of the data ?
feat = data@featureData feat feat@data

Retrieving experiment annotation using affy

Microarray data sets should also include information on the experiment. AffyBatches have a slot for this called experimentData. For most data sets (also public data coming from GEO or ArrayExpress) the featureData has not been defined.

How to retrieve the experiment annotation ?
exp = data@experimentData exp

Retrieving annotation using oligo

All commands described above also work for the oligo package.

Retrieving other types of information

How to retrieve the name of the CDF file associated with the arrays ?
Use the cfdName() method on the AffyBatch cdfName(data)

How to retrieve the IDs of the probe sets that are represented on the arrays ?
Use the featureNames() method on the AffyBatch featureNames(data)

How to retrieve the number of probe sets represented on the arrays ?
Use the length() method to count the number of items in a vector length(featureNames(data))

How to retrieve the number of probes represented on the arrays ?
Use the length() method to count the number of items in the vector containing names of all probes length(probeNames(data))

On Affymetrix arrays you should see an average of 10-20 probes per gene.

Quality control of microarray data

Giving the samples informative names

The code in this section works for both affy and oligo

If the phenoData object, that was created in the step where we retrieved the sample annotation, does not contain any information, Bioconductor will just give the CEL-files an index 1-6. However, the phenoData will be used as labels in plots. Give the samples more accurate names so they can be used in the plots that we are going to create.

How to add annotation to phenoData ?
We define that the first column of the data slot in the phenoData corresponds to the vector containing the sample names created by the c() command: ph@data[ ,1] = c("wt1","wt2","wt3","mut1","mut2","mut3") ph This is an example where we have a data set consisting of two groups of 3 replicates, 3 wild type controls and 3 mutants. If you have 3 groups of 3 replicates: ph@data[ ,1] = c("control1","control2","control3","iso1","iso2","iso3","swim1","swim2","swim3") ph@data This is an example where we have a data set consisting of three groups of 3 replicates, 3 control mice, 3 mice that were treated with a drug and 3 mice were treated by performing physical exercises

You have to adjust this code according the number of groups and the number of replicates you have and change the sample names to names that are relevant for you. The order of the samples in the AffyBatch is determined by the CEL-file name of the sample (the CEL files are alphabetically ordered).

Create plots to assess the quality of the data

Instead of printing these plots in RStudio or the R editor, we will save the plots to our hard drive.

Microarray pictures

Microarray pictures can show large inconsistencies on individual arrays.

How to create microarray pictures

Chip pseudo-images

How to create chip pseudo-images

Histograms

Another quality control check is to plot the distribution of log base 2 intensities (log2(PMij) for array i and probe j) of perfect match probes for comparison of probe intensity behavior between different arrays. If you see differences in shape or center of the distributions, it means that normalization is required.

How to create histograms of microarray data

Box plots

Boxplots and histograms show the same differences in probe intensity behavior between arrays. In order to perform meaningful statistical analysis and inferences from the data, you need to ensure that all the samples are comparable. To examine and compare the overall distribution of log transformed PM intensities between the samples you can use a histogram but you will get a clearer view with a box plot. Box plots show:

the median: center value, half of the intensities are lower than this value, half of the intensities are higher (= line in the box)
the upper quartile: a quarter of the values are higher than this quartile (= upper border of the box)
the lower quartile: a quarter of the values are lower than this quartile (= lower border of the box)
the range: minimum and maximum value (= borders of the whiskers)
individual extreme values (= points outside the whiskers)

How to create boxplots of microarray data

MA plots

Originally these MA plots were developed for two-color arrays to detect differences between the two color labels on the same array, and for these arrays they became hugely popular. This is why more and more people are now also using them for Affymetrix arrays but on Affymetrix only use a single color label. So people started using them to compare each Affymetrix array to a pseudo-array. The pseudo array consists of the median intensity of each probe over all arrays.

The MA plot shows to what extent the variability in expression depends on the expression level (more variation on high expression values?). In an MA-plot, A is plotted versus M:

M is the difference between the intensity of a probe on the array and the median intensity of that probe over all arrays
M = logPMInt_array – logPMInt_medianarray
A is the average of the intensity of a probe on that array and the median intesity of that probe over all arrays
A = (logPMInt_array + logPMInt_medianarray)/2

How to create MA plot of microarray data

Ideally, the cloud of data points in the MA-plot should be centered around M=0 (blue line). This is because we assume that the majority of the genes is not DE and that the number of upregulated genes is similar to the number of downregulated genes. Additionally, the variability of the M values should be similar across different array-medianarray combinations. You see that the spread of the point cloud increases with the average intensity: the loess curve (red line) moves further and further away from M=0 when A increases. To remove (some of this) dependency, we will normalize the data.

Calculate quality measures to assess the quality of the data

Calculate quality measures in affy

Apart from images, you can also calculate simple quality metrics to assess the quality of the arrays.

How to calculate the quality measures of each array ?
Quality measures can be calculated by applying the qc() method on an AffyBatch object. data.qc = qc(data) The first quality measure are the average intensities of the background probes on each array. You can retrieve them by using the avbg() method. According to Affymetrix guidelines, the average background values of different arrays should be comparable. avbg(data.qc) The second quality measure are the scale factors: factors used to equalize the mean intensities of the arrays. You can retrieve them by using sfs() method. The scale factors should be within 3-fold of each other. sfs(data.qc) The third quality measure are the percent present calls: the percentage of spots that generate a significant signal (significantly higher than background) according to the Affymetrix detection algorithm. You can retrieve them by using percent.present() method. The precent present calls should be similar especially among replicates. Very low values (<20%) are a possible indicator of a poor quality array. percent.present(data.qc) The last quality measure are the 3'/5' ratios of the quality control probe sets representing housekeeping genes (genes expressed in any tissue in any organism) like actin and GADPH. For these housekeeping genes 3 probe sets are available: one at the 5' end of the gene, one at the middle and one at the 3' end of the gene. You can retrieve them by using ratios() method. Affymetrix suggests that 3'/5' ratios below 3 show acceptable RNA degradation and recommend caution if that value is exceeded for a given array. ratios(data.qc)

Calculate quality measures in oligo

The qc() method is implemented in the simpleaffy package but not in the oligo package. This means that you either have to

load both simpleaffy and oligo but then you always have the problem that you have to explicitly write oligo:: in front of every method you want to use from the oligo package (see section on using a method from a specific package)
first load simpleaffy, perform the qc() analysis, restart R, load oligo and perform the rest of the analysis

Normalization of microarray data

There are many sources of noise in microarray experiments:

different amounts of RNA used for labeling and hybridization
imperfections on the array surface
imperfect synthesis of the probes
differences in hybridization conditions

Systematic differences between the samples that are due to noise rather than true biological variability should be removed in order to make biologically meaningfull conclusions about the data.

Normalization using RMA

The code in this section works for both affy and oligo

The standard method for normalization is RMA. RMA is one of the few normalization methods that only uses the PM probes:

Background correction to correct for spatial variation within individual arrays: a background-corrected intensity is calculated for each PM probe in such a way that all background corrected intensities are positive
Log transformation to improve the distribution of the data: the base-2 logarithm of the background corrected intensity is calculated for each probe. The log transformation will make the data less skewed and more normally distributed and provide an equal spread of up- and downregulated expression ratios
Quantile normalization to correct for variation between the arrays: equalizes the data distributions of the arrays and make the samples completely comparable
Probe normalization to correct for variation within probe sets: equalizes the behavior of the probes between the arrays and combines normalized data values of probes from a probe set into a single value for the whole probe set

How to normalize the data using RMA ?
The rma() method in the affy package produces a data matrix for Affymetrix arrays. The input for rma() is an AffyBatch object while the output from rma() is an exprSet object with the data matrix containing the normalized log-intensities in the exprs slot. data.rma = rma(data) data.matrix = exprs(data.rma) The ExpressionSet class is the superclass of the AffyBatch class meaning that AffyBatches are a special type of ExpressionSets. AffyBatches will therefore have the same characteristics and behaviour as ExpressionSets but AffyBatches will also have a set of specific characteristics and functions that are not shared by ExpressionSets.

Check out this excellent overview of RMA.

Normalization using GCRMA

GCRMA in affy

GCRMA is based on RMA, having all the good sides of RMA. The difference lies in the background correction, all other steps are the same. GCRMA corrects for non-specific binding to the probes in contrast to RMA which completely ignores the issue of non-specific binding.

GCRMA uses probe sequence information to estimate probe affinity to non-specific binding. Each nucleotide of the probe contributes to the affinity of the probe. The contributions of each nucleotide on each position of the probes were estimated in a pilot experiment with artificial DNAs that mismatch the probes, done by the people who developed the algorithm. In this experiment there was no specific binding so the only thing was measured was non-specific binding. The data of this experiment allowed to estimate the affinities of the probes.

GCRMA allows you to use the affinities calculated based on this reference experiment or you can let GCRMA compute probe affinities based on the signals of the negative control probes of your own microarray experiment.

The gcrma package contains all available methods to perform GCRMA normalization.

How to do a GCRMA normalization ?
First of all, you need to load the gcrma package: library(“gcrma”) To perform GCRMA normalization: data.gcrma = gcrma(data) As said before, GCRMA uses the affinity of each probe during background correction. These probe affinities are stored in an AffyBatch object, called affinity.info. By default, affinity.info is not specified in the gcrma command: then affinities are computed by gcrma based on the data of the reference experiment. However, the user can also choose to compute the affinities based on the data of their own experiment and use these affinities during normalization: my.affinity.info = compute.affinities.local(data) data.gcrma = gcrma(data,affinity.info=my.affinity.info) The gcrma command comes with two additional arguments fast: the default value is FALSE. If fast is set to TRUE, specific procedures are used to speed up the background correction. type: has 3 values fullmodel: uses probe sequences to calculate the affinities and combines the affinities with MM probe signals for background extraction affinities: uses probe sequences to calculate affinities and ignores MM probes MM: background correction based on MM intensities alone So to speed up calculations use the following command: data.gcrma = gcrma(data,fadt=TRUE)

GCRMA in oligo

GCRMA makes use of the IndexProbes() method a lot, which is a method that works on AffyBatches and is not implemented for FeatureSets. As a result, there is no easy way to do GCRMA normalization in the oligo package.

Checking the effect of the normalization

Comparing raw and background-corrected data

To see the effect of the background correction you can create a plot of raw versus background corrected data.

How to compare raw and background-corrected microarray data

Comparing raw and normalized data

Comparing raw and normalized data in affy

After normalization you can compare raw and normalized data. You can do this for individual genes or for all genes.
We'll start by making the comparison for individual genes.

How to compare raw and normalized intensities for individual genes ?
The normalized intensities are stored in data.matrix. The probe set IDs are used as row names: data.matrix["256852_at",] Raw intensities are stored in data, you can retrieve the raw PM intensities by using the pm() method. By using the probe set ID as a second argument, you can retrieve the PM intensities of the row with this name: pm(data,"256852_at")

Comparing raw and normalized data in oligo

After normalization oligo does use probe set IDs as row names in the data.matrix object so you can retrieve normalized data for a specific probe set e.g. for HTA 2.0 arrays:

data.matrix["2824546_st",]

However, retrieving raw data by specifying a probe set ID in the pm() method does not work in oligo.

Comparing raw and normalized data using boxplots

How to create boxplots of microarray data

After normalization, none of the samples should stand out from the rest. The different arrays should have a very comparable median expression level. Also the scale of the boxes should be very comparable indicating that the spread of the intensity values on the different arrays is equalized.

Comparing raw and normalized data using MA plots

How to create MA plot of microarray data

When you compare this plot to the one created for the raw intensities on the same array, you see a much more symmetric and even spread of the data points indicating that the dependence of the variability on the expression level is not as strong as it was before normalization.

PCA plot

To check whether the overall variability of the samples reflects their grouping, you can perform a Principal Component Analysis. In other words, you can check if replicates are homogenous and distinguishable from samples of (the) other group(s).

How to create a PCA plot of microarray data

When you have groups in your data this should lead to a clear distinction between the groups.

Identification of DE genes

Identification of DE genes is not done by the affy nor the oligo package but by the limma package. Limma uses the output of the rma() method (data.rma) as input. Since the output of the rma() method is the same in the affy and in the oligo package, limma works well with both packages. This means that all following code is valid for all normalized Affymetrix data regardless of the package that was used for normalization.

Two groups of samples

The limma package contains functions for using a t-test or an ANOVA to identify differential expression in microarray data. These functions can be used for all array platforms and work even for microarray data with complex designs of multiple samples. The central idea is to fit a linear model to the expression data of each gene. The expression data can be log-ratios or log-intensities. Computational methods are used to borrow information across genes making the analyses stable even for experiments with a small number of arrays. Limma is designed to be used in conjunction with the affy package.

How to use limma for comparing two groups of samples ?
First of all you need to tell limma which samples are replicates and which samples belong to different groups by providing this information in the phenoData slot of the AffyBatch/FeatureSet. To this end, we add a second column with sample annotation describing the source of each sample. We will give this new column a name. ph@data[ ,2] = c("control","control","control","mutant","mutant","mutant") colnames(ph@data)[2]="source" ph@data Now you need to tell limma which sample belongs to which group. This info is contained in the second column (the column that we called source) of the PhenoData. groups = ph@data$source The names of the groups have to be transformed into factors. Factors in statistics represent the variable that you control: which sample belongs to which group. You can factorize the grouping variable by using the factor() method f = factor(groups,levels=c("control","mutant")) Then you need to create a design matrix, a matrix of values of the grouping variable. ANOVA needs such a matrix to know which samples belong to which group. Since limma performs an ANOVA or a t-test (which is just a specific case of ANOVA), it needs such a design matrix. You can create it using the model.matrix() method. The argument of the model.matrix method is a model formula. The argument of the model.matrix method is a model formula. The tilde (~) in the argument specifies the right hand side of a model equation. If you define the design matrix with ~0, limma will simply calculate the mean expression level in each group. Provide informative names for each column using the colnames() method. design = model.matrix(~ 0 + f) colnames(design) = c("control","mutant") Essentially, limma will compare the mean expression level in the control samples to the mean expression level in mutant samples and this for each gene. So first of all, limma needs to calculate the mean expression levels using the lmFit() method. This method will fit a linear model (defined in design) to the data to calculate the mean expression level in the control and in the mutant samples: data.fit = lmFit(data.matrix,design) You can view the results of the fit. For instance you can show the estimates of the fit for the first 10 probe sets. data.fit$coefficients[1:10,] The first column contains the mean log expression in control samples. The second column contains the mean log expression in mutant samples. Now you have to tell limma which groups you want to compare. For this you define a contrast matrix defining the contrasts (comparisons) of interest by using the makeContrasts() method. Using the column names of the design matrix you specify the baseline sample (control) you want to compare to and the sample of interest (mutant). contrast.matrix = makeContrasts(mutant-control,levels=design) data.fit.con = contrasts.fit(data.fit,contrast.matrix) Now limma is ready to perform the statistical test to compare the groups. Since we are comparing two groups, it will perform a t-test. Limma does not perform a standard t-test but it will calculate a moderated t-statistic with shrunken standard deviation for each gene (see slides). An empirical Bayes method is used to shrink the variance of each gene towards a common value for all the genes. This is to lower the influence of very low or very high standard deviations on the t-test. Since the number of replicates is very low, the standard deviations will not be very reliable, ordinary t-statistics are not recommended. However, although there are only a few replications for each gene, the total number of measurements is very large. So information is combined across the genes (i.e., genome-wide shrinkage) to improve performance. The moderated t-test is performed by using the eBayes() method. data.fit.eb = eBayes(data.fit.con) The eBayes() method returns a data frame data.fit.eb containing multiple slots. To get an overview of all the slots, use the names() method: names(data.fit.eb) You can retrieve the log fold changes of each gene via the coefficients slot: data.fit.eb$coefficients[1:10,] This slot contains the coefficient of the contrast: it's simply the difference between the mean log expression in mutant samples and the mean log expression in control samples. This difference is usually called a log fold change. The eBayes() method has performed a moderated t-test on each gene. That means it has calculated a t-statistic and a corresponding p-value for each gene. These values can be found in the t and the p.valuje slot. To view the t-statistics and p-values of the moderated t-test for the first 10 probe sets: data.fit.eb$t[1:10,] data.fit.eb$p.value[1:10,] Remarkably, data.fit.eb also contains F-statistics and p-values for this F-statistic in the F and F.p.value slots while an F-statistic is in an ANOVA. Essentially, a t-test is a special case of an ANOVA used for single comparisons. If you have just a single comparison the F-statistic is the square of the t-statistic. For multiple comparisons, this relation is not true.

In the above example we compared samples from two groups but in many microarray experiments, there are more than two groups. In this case we have to perform a moderated one-way ANOVA for each gene.

Three groups of samples

How to compare three groups ?
As an example we will compare drug treated mice and mice treated with physical exercise to a set of control mice. First of all you need to tell limma which samples are replicates and which samples belong to different groups by providing this information in the phenoData slot of the AffyBatch/FeatureSet. To this end, we add a second column with sample annotation describing the source of each sample. We will give this new column a name: ph@data[ ,2] = c("control","control","control","drug","drug","drug","exercise","exercise","exercise") colnames(ph@data)[2]="source" ph@data Since we have 3 groups with 3 replicates each, the factor that determines the grouping will have 3 levels instead of 2, so the code is as follows: f = factor(groups,levels = c("control","drug","exercise")) Then you need to create a design matrix, a matrix of values of the grouping variable. ANOVA needs such a matrix to know which samples belong to which group. Since limma performs an ANOVA, it needs such a design matrix. You can create it using the model.matrix() method. The argument of the model.matrix method is a model formula. The tilde (~) in the argument specifies the right hand side of a model equation. If you define the design matrix with ~0, limma will simply calculate the mean expression level in each group. design = model.matrix(~ 0 + f) colnames(design)=c("control","drug","exercise") design data.fit = lmFit(data.rma,design) Afterwards you can tell limma which groups you want to compare. For this you define a contrast matrix defining the contrasts (comparisons) of interest by using the makeContrasts() method. In this example we make all pairwise comparisons: contrast.matrix = makeContrasts(exercise-control,drug-exercise,drug-control,levels=design) data.fit.con = contrasts.fit(data.fit,contrast.matrix) data.fit.eb = eBayes(data.fit.con) You can view the results of the ANOVA in the slots of the data.fit.eb object. The statistic that is calculated in ANOVA is the F-statistic, you can find the F-statistic and its corresponding p-value for each gene in the F and F.p.value slots. ANOVA will just determine if there is a difference between the 3 groups but it will not tell you which groups differ: is it exercise and drug or drug and control ? You have no idea based on the results of the ANOVA. That's why an ANOVA is always followed by a series of pairwise comparisons. The t-statistics and the resulting p-values of the pairwise comparisons are stored in the t and p.value slots. Log fold changes can be found in the coefficients slot. data.fit.eb$coefficients[1:10,] data.fit.eb$t[1:10,] data.fit.eb$p.value[1:10,] data.fit.eb$F[1:10] data.fit.eb$F.p.value[1:10]

Paired data

Up to now we have always assumed that the groups are independent, meaning that samples of one group have no relation whatsoever with samples from another group. In independent data sets you have two groups of subjects (patients, mice, plants,....). You use one group as controls and you give the other group a treatment.
Paired or dependent data means that there exists a relationship between the samples of different groups. The typical example is having a group of subjects and measuring each of them before and after treatment.
It is important to tell limma if your data is paired or not since you need to use a different type of statistical test on paired compared to independent data:

paired t-test instead of regular two-samples t-test
repeated measures ANOVA instead of regular one-way ANOVA

How to compare two groups of paired data ?
As an example we will compare 2 groups of 3 patients before and after treatment: Treatment is the grouping variable dividing the data set into two groups: before and after treatment. To tell limma that your data is paired you just create a second grouping variable called patient: ph@data[ ,2] = c("before","treated","before","treated","before","treated") colnames(ph@data)[2]="Treatment" ph@data[ ,3] = c("patient1","patient1","patient2","patient2","patient3","patient3") colnames(ph@data)[3]="Patient" ph@data Of course, now you need to factorize both grouping variables: groupsP = ph@data$Patient groupsT = ph@data$Treatment fp = factor(groupsP,levels=c("patient1","patient2","patient3")) ft = factor(groupsT,levels=c("before","treated")) Then you need to create a design matrix, a matrix of values of the grouping variable. ANOVA needs such a matrix to know which samples belong to which group. Since limma performs an ANOVA, it needs such a design matrix. You can create it using the model.matrix() method. The argument of the model.matrix method is a model formula. The tilde (~) in the argument specifies the right hand side of the model equation. The plus (+) is used to combine factors. paired.design = model.matrix(~ fp + ft) colnames(paired.design)=c("Intercept","Patient2vsPatient1","Patient3versusPatient1","AftervsBefore") data.fit = lmFit(data.rma,paired.design) data.fit$coefficients Lmfit() will fit a linear model to the data. It will create a data frame called data.fit. The interesting data is in the coefficients table which contains 4 columns: the first column contains the intercept of the linear fit, in most cases it has no implicit meaning. the second column compares patient 1 and patient 2: it is the difference between the average expression of patient 2 over the two treatments and the average expression level of patient 1 over the two treatments. the third column compares patient 1 and patient 3 in the same way. the fourth column compares before and after treatment: it calculates the average difference in expression level between after and before treatment. This is the measure that is used in the paired t-test and compared to 0. So these are the values we are interested in. The more this measure differs from 0 the more likely it is that the treatment has an effect. Performing a moderated paired t-test is now done using eBayes(). data.fit.eb = eBayes(data.fit) data.fit.eb$p.value[,1:20] Again, the relevant p-values are in the fourth column, called Treatment. These are the p-values generated by the comparison of after treatment and before treatment.

How to compare three groups of paired data ?
As an example we will compare 3 groups of 3 patients before, during and after treatment: Treatment is the grouping variable dividing the data set into two groups: before and after treatment. To tell limma that your data is paired you just create a second grouping variable called patient: ph@data[ ,2] = c("before","during","treated","before","during","treated","before","during","treated") colnames(ph@data)[2]="Treatment" ph@data[ ,3] = c("P1","P1","P1","P2","P2","P2","P3","P3","P3") colnames(ph@data)[3]="Patient" ph@data Of course, now you need to factorize both grouping variables: groupsP = ph@data$Patient groupsT = ph@data$Treatment fp = factor(groupsP,levels=c("P1","P2","P3")) ft = factor(groupsT,levels=c("before","during","treated")) Then you need to create a design matrix, a matrix of values of the grouping variable. ANOVA needs such a matrix to know which samples belong to which group. Since limma performs an ANOVA, it needs such a design matrix. You can create it using the model.matrix() method. The argument of the model.matrix method is a model formula. The tilde (~) in the argument specifies the right hand side of a model equation. The plus (+) is used to combine factors. paired.design = model.matrix(~ fp + ft) colnames(paired.design)=c("Intercept","P2vsP1","P3vsP1","DuringvsBefore","AftervsBefore") data.fit = lmFit(data.rma,paired.design) data.fit$coefficients Lmfit() will fit a linear model to the data. It will create a data frame called data.fit. The interesting data is in the coefficients table which contains 5 columns: the first column contains the intercept of the linear model. In most cases, it has no intrinsic meaning the second column compares patient1 and patient2: it is the difference between the average expression of patient2 over the 3 treatments and the average expression level of patient1 over the 3 treatments. the third column compares patient1 and patient3 in the same way. the fourth column compares during and before treatment: it calculates the average difference in expression level between during and before treatment. This is the measure that is used in the repeated measures ANOVA. So these are values we are interested in. the fifth column compares after and before treatment: it calculates the average difference in expression level between after and before treatment. This is the measure that is used in the repeated measures ANOVA. So these are values we are interested in. Performing the repeated measures ANOVA is now done using eBayes(). data.fit.eb = eBayes(data.fit) data.fit.eb$p.value[,1:20] Again, the relevant p-values are in the fourth and fifth column, called DuringvsBefore and AftervsBefore. These are the p-values generated by the comparison of bfore, during and after treatment. What if you also want to compare after and during treatment ? Then you need to create a contrast matrix to tell limma which comparison to make. To do this you use the makeContrasts() method: cont.matrix = makeContrasts(aftervsduring=after-during,levels=paired.design) data.contr = contrasts.fit(data.fit,cont.matrix) data.contr$coefficients data.fit.eb = eBayes(data.contr)

Two grouping variables

Sometimes you have two grouping variables, e.g.

gender and treatment if you think a drug has a different effect on men and women
strain and treatment if you think a drug has a different effect on different mice strains

How to compare microarray data grouped according to two variables

Generating a Volcano plot

The best way to decide on the number of DE genes you are going to select is via a Volcano plot.

How to generate a Volcano plot of the data ?
A Volcano plot is generated by using the volcanoplot() method on the output of the moderated t-test. The coef parameter specifies the column of data.fit.eb that should be used for the plot. In our example in Arabidopsis coef=2 since the second column of data.fit.eb contains the results of the comparison between mutant and control plants. In our example of paired data data.fit.eb contains four columns: the second containing the results of the comparison between patient2 and patient1 the third containing the results of the comparison between patient3 and patient1 the fourth containing the results of the comparison between before and after the treatment So in this example you set coef=2 or coef=3 if you want to find genes that are DE between patients or coef=4 if you want to find genes that are DE as a result of the treatment. The highlight parameter allows to specify the number of highest scoring genes (on the Y-axis) for which names will be attached on the plot. name = "Volcano.jpg" jpeg(name) volcanoplot(data.fit.eb,coef=2,highlight=10) dev.off()

Volcano plots arrange genes along biological and statistical significance. The X-axis gives the log fold change between the two groups (log: so that up and down regulation appear symmetric), and the Y-axis represents the p-value of a t-test comparing samples (on a negative log scale so smaller p-values appear higher up). The first axis indicates biological impact of the change; the second indicates the statistical evidence of the change.
To decide on the number of DE genes that you’re going to proceed with, you can make Volcano plots highlighting different numbers of genes. As you increase the number of genes highlighted on the plot, you can see at what point you begin to select genes with rather low fold changes. Make a conservative decision about the number of genes you want to use for follow up.

Adjusting for multiple testing and defining DE genes

You are doing a t-test on each gene, meaning that you will be doing more than 20000 t-tests on the data set. You have to adjust the p-values of the t-tests for multiple testing or you will generate too many false positives.

Of course your final aim is to generate a table containing the genes that you consider DE (the genes with the lowest adjusted p-values and the most extreme log fold changes). You can then use this table to search for functional relations between these genes...

How to define a table of DE genes in a microarray experiment

Creating lists of probe set IDs of DE genes for functional analysis

We go back to our simple comparison of mutant and wild-type samples. We have created a list of upregulated genes called topups and a list of downregulated genes (topdowns).

How to create a list of IDs of genes that are DE according to topTable() ?
The tab table that is generated by the topTable() method and all the tables that are derived from it like topups and topdowns contain the probe set IDs of the selected genes as row names: IDs.up = rownames(topups) IDs.down = rownames(topdowns)

What if you used decideTests() to identify DE genes ? This method generates a matrix containing labels (0, 1 or -1) for each gene in each contrast.

How to create a list of IDs of genes that are DE according to decideTests() ?
Assume that we want to obtain the lists of genes that are up/downregulated in the exercise group compared to the control group. The labels for this contrast are stored in the first column of the DEresults matrix that was generated by the decideTests() method. First you select the upregulated genes (genes with value 1 in the first column) via the subset() method. Then you obtain their IDs via the rownames() method. ups = subset(DEresults, DEresults[,1]==1) downs = subset(DEresults, DEresults[,1]==-1) IDs.up = rownames(ups) IDs.down = rownames(downs)

You'll probably want to write the list of IDs to your computer so that you can use it in tools like DAVID.

How to write the probe set IDs of the DE genes to your computer ?
We will create a text file with one ID on each line since this is a format that is accepted by most tools. You can do this by using the write.table() method. The file argument specifies the file that you want to write to. The row.names argument specifies if row names are to be printed (the default is TRUE!). The col.names argument specifies if column names are to be printed (the default is TRUE!). The quote argument specifies if character or factor columns will be surrounded by double quotes (the default is TRUE!). write.table(IDs.up,row.names=FALSE,col.names=FALSE,quote=FALSE,file="d:/trainingen/R/upIDs.txt") write.table(IDs.down,row.names=FALSE,col.names=FALSE,quote=FALSE,file="d:/trainingen/R/downIDs.txt")

Creating a heat map of the DE genes

How to create a heat map of the normalized intensities of the DE genes ?
In this example we will create a heat map of the downregulated genes. Heat maps can be created via the ggplot() method. As always we need to get the data into the correct format for ggplot: a data matrix with normalized log intensities in the first column sample names in the second column probe set IDs in the third column We first have to obtain the normalized intensities of the upregulated genes. The normalized intensities are stored in data.matrix. The downregulated genes are stored in topdowns. We can obtain their probe set IDs via the rownames() method. Then we select their normalized intensites from data.matrix using their probe set IDs: data.matrix.up = data.matrix[(rownames(topdowns)),] We create three vectors: one to store the sample names in, called sampleNames one to store the probe set IDs in, called featureNames one to store the normalized log intensities in, called heatlogs The heatlogs vector will contain the normalized intensities of the upregulated genes in all six samples stacked into a single column. There are 71 downregulated genes in my data set. I can see that by using the dim() method and looking at the first number it returns (dim(topdowns)[1]). This means that the sampleNames vector needs 254 entries of each sample name stacked into a single column. The featureNames vector needs the probe set IDs of these genes repeated 6 times and stacked into a single column. sampleNames = vector() featureNames = vector() heatlogs = vector() for (i in 1:6) { sampleNames = c(sampleNames,rep(ph@data[i,1],dim(topdowns)[1])) featureNames = c(featureNames,rownames(data.matrix.up[1:dim(topdowns)[1],])) heatlogs = c(heatlogs,data.matrix.up[1:dim(topdowns)[1],i]) } Then we combine sample names, probe set IDs and normalized intensities into one data frame: heatData = data.frame(norm_logInt=heatlogs,sampleName=sampleNames,featureName=featureNames) Now we can create the plot. A heat map can be created using the geom_tile() method. Low normalized intensities will be plotted in green and high normalized intensities will be plotted in red. This can be done by using the scale_fill_gradient() method. dataHeat = ggplot(heatData, aes(sampleName,featureName)) dataHeat + geom_tile(aes(fill=norm_logInt)) + scale_fill_gradient(low="green", high="red")

Creating a Venn diagram to compare results of multiple comparisons

This can be easily done on the output of the decideTests() method.

How to create a Venn diagram ?
A Venn diagram can be created using the vennDiagram() method. vennDiagram(DEresults)