Analyze your own microarray data in R/Bioconductor
On this page you can find the R-code to identify DE genes based on Affymetrix microarray data
Contents
- 1 Installation
- 2 Loading packages in R
- 3 Using a method from a specific package
- 4 Getting help / documentation in R
- 5 Open the CEL-files in R
- 6 Looking at the data
- 7 Quality control of microarray data
- 8 Normalization of microarray data
- 9 Identification of DE genes
- 9.1 Two groups of samples
- 9.2 Three groups of samples
- 9.3 Paired data
- 9.4 Two grouping variables
- 9.5 Generating a Volcano plot
- 9.6 Adjusting for multiple testing and defining DE genes
- 9.7 Creating lists of probe set IDs of DE genes for functional analysis
- 9.8 Creating a heat map of the DE genes
- 9.9 Creating a Venn diagram to compare results of multiple comparisons
Installation
Install R (and RStudio)
Check out our R introduction tutorial to learn how to install R and RStudio.
Install the required R packages
To analyze microarray data, you need a specific R package, called Bioconductor.
However, Bioconductor uses functions and object from various other R packages, so you need to install these R packages too:
- Matrix
- lattice
- fdrtool
- rpart
Additionally, you will need an R-package for making graphs of the data, called ggplot2
Check out our R introduction tutorial to learn how to install these packages.
Install Bioconductor packages from Bioconductor repository
You need the following Bioconductor packages for Affymetrix array analysis:
- affy, affyPLM and simpleaffy if you want with older arrays (3' arrays)
- oligo if you work with newer arrays (HTA, Gene ST...)
- affydata if you need example data sets
- ArrayExpress if you want to obtain data from ArrayExpress
- limma
- Biobase
- Biostrings
- genefilter
- Bioconductor packages for the annotation of the spots on the arrays
It is important to realize that it is best to pick one of these two choices. If you load multiple packages with similar functionality, e.g. both affy and oligo or both simpleaffy and oligo, R will become confused. The affy-based packages and oligo contain methods with the same name (e.g. intensity(), MAplot(), rma()) but with slightly different code. So R will not know if it has to use the intensity() method from the affy packages or from oligo. By default it chooses affy over oligo. So you always have to specify the packagename for the oligo methods (see section on how to specify the package of a method) and even then it does not always work well.
How to install Bioconductor packages in R
Installing Bioconductor packages from source
Some Bioconductor packages will not install by running the biocLite() command. In that case you can always install them from source. In this case you go to the Bioconductor page of the package you wish to instal, as an example we take the Biostrings package (allthough it installs fine using the biocLite() command.
How to install this package from source ? |
---|
|
Updating installed Bioconductor packages
After some time, Bioconductor packages might become outdated. Bioconductor packages are updated fairly regularly.
How to update BioConductor packages in R ? |
---|
To identify packages that can be updated within your version of Bioconductor, start a new session of R and enter the following commands:
source("http://www.bioconductor.org/biocLite.R") biocLite() |
Loading packages in R
Before you can do any programming in RStudio, you need to create a new project and a new script.
To use the installed R and BioConductor packages in R, you have to load them first.
Check out our R introduction tutorial to learn how to load these packages.
Using a method from a specific package
If you have loaded multiple packages with similar functionality, e.g. both affy and oligo or both affyPLM and oligo, R might become confused. Affy or affyPLM and oligo contain methods with the same name e.g. intensity(), MAplot(), rma()... but with slightly different code. If you now simply use the command
intensity(data)
R will not know if it has to use the intensity() method from affy or from oligo. By default it chooses affy over oligo. To tackle this problem you have to specify the packagename in front of the name of the method for oligo, e.g.
oligo::intensity(data)
and R will know that it has to use the intensity() method of the oligo package.
Getting help / documentation in R
Check out our R introduction tutorial to learn how to consult the R documentation.
Bioconductor is object-oriented R. It means that a package consists of classes. The classes define the behaviour and characteristics of a set of similar objects that belong to the class. The characteristics that objects of a class can have are called slots while the behaviour of the objects (the actions they can do) is described by the methods of a class.
For instance, there is a class AffyBatch, consisting of containers that hold microarray data in a structured way. You can create an AffyBatch object to hold your data. AffyBatches have many slots, one for the intensities, one for the sample annotation, one for the probe set annotation... There are numerous methods that operate on AffyBatches: you can retrieve perfect match intensities, you can create a boxplot, you can normalize the data...
How to show the documentation of a class ? |
---|
You can ask for documentation of classes:
help(classname) |
The documentation of a class shows:
- a short description of the class
- how to create objects of the class
- a list of slots of the class
- a list of methods that can operate on objects of the class
- some examples
How to check which slots are available for an object ? |
---|
To find out which slots are available for a object, use the following command:
slotNames(objectname) |
Open the CEL-files in R
If you use Affymetrix chips your microarray data will consist of a series of CEL files containing raw intensities for each probe on the array. Save all CEL files you want to analyze in a single directory. They may be gzipped (*.gz) - you do not need to gunzip them.
Open CEL files from 3' Affymetrix Arrays (older ones) using affy
Data from older Affymetrix arrays should be analysed using the affy package.
How to open CEL files using affy ? |
---|
You have to tell R where your CEL files are stored. R cannot find them by itself. So you have to give R the path on your computer of the folder that contains the CEL files.
# specify the path on your computer where the folder that contains the CEL-files is located celpath = "D:/R-2.15.2/library/affydata/celfiles/Apum/" Then you can import all the CEL files by a single command using the ReadAffy() method. ReadAffy will read all CEL files in the folder and load them into an AffyBatch object in R. You use the celfile.path argument to specify the location of the folder that contains the CEL files. # import CEL files containing raw probe-level data into an R AffyBatch object data = ReadAffy(celfile.path=celpath) So data is now an AffyBatch object containing the data from your CEL files. |
If you are using a custom cdf file you need an additional line of code telling R that you are not using the standard Affymetrix cdf but a custom one.
How to use a custom cdf file from BrainArray ? |
---|
If you want to use a custom cdf from the BrainArray website you have to indicate to R you want to use the custom cdf before you run the ReadAffy method. To this end, AffyBatches have a slot called cdfName in which the name of the cdf file that is to be used for the analysis is stored. If you want to use a custom cdf you have to store its name in the cdfName slot of your AffyBatch object. Slots can be accessed using the “@” sign. So you can access a slot by using object@ followed by the name of the slot. # indicate you want to use the custom cdf # If you don't specify the cdfname, BioConductor will use the default Affymetrix cdf. data@cdfName="ATH1121501_At_TAIRT" You can find the name of the custom cdf by going to the package folder:
Open the DESCRIPTION file and look in the Description line:
|
Open CEL files from newer Affymetrix Arrays (HTA, Gene ST...) using oligo
For these arrays the old affy package doesn't work - instead you'll have to use the oligo package
How to open CEL files using oligo ? |
---|
You have to tell R where your CEL files are stored. R cannot find them by itself. So you have to give R the path on your computer of the folder that contains the CEL files.
# specify the path on your computer where the folder that contains the CEL-files is located celpath = "D:/R-2.15.2/library/affydata/celfiles/HTA/" The list.files() command should be used to obtain the list of CEL files in the folder that was specified by the celpath. You should therefore make sure that this folder only contains the CEL files you want to analyze. Then you can import all the CEL files by a single command using the read.celfiles() method. The oligo package does not save the data in an AffyBatch (as affy does) but uses different containers depending on the type of arrays used, e.g. HTAFeatureSet for HTA arrays, ExonFeatureSet for Exon ST arrays... This approach allows oligo to behave differently depending on the type of array. # import CEL files containing raw probe-level data into an R AffyBatch object list = list.files(celpath,full.names=TRUE) data = read.celfiles(list) So data is now a specific FeatureSet object containing the data from your CEL files. |
Looking at the data
- To view the full content of a simple object (vector, list, matrix or data frame): type the name of the object in the console.
- To view specific items of a simple object: refer to their location using [ ] e.g.
object[1,2] will retrieve the item on the first row and second column
object[1:2,] will retrieve all columns of the first two rows - To view an entire column of a data frame: use $ e.g. dataframe$columnname
- To access slots of complex objects: use object@ followed by the name of the slot.
Retrieving intensities
Retrieving intensities using affy
How to retrieve intensities using affy
Retrieving intensities using oligo
All code that is described above will work in the oligo package except for the code that use probeset IDs to retrieve their corresponding intensities.
However, if you have loaded oligo together with an affy-based package, affy, simpleaffy, affyPLM you will have to specify oligo:: in front of the intensity() and the pm() method (see section on specifying the package name of a method).
Retrieving annotation
Retrieving sample annotation using affy
Apart from the expression data itself, microarray data sets need to include information about the samples that were hybridized to the arrays, e.g. sample labels. AffyBatch objects have several slots (characteristics). One of them is called phenoData. It contains labels for the samples. Despite the name, there is no implication that the labels should be phenotypic, in fact they often indicate genotypes such as wild-type or knockout. They can also contain other types of information about the samples e.g. which samples are replicates, which samples were scanned by the same scanner, the amount of RNA that was hybridized to the arrays…
However, for most data sets the phenoData has not been defined.
How to retrieve the sample annotation of the data ? |
---|
ph = data@phenoData ph
As you can see, ph is a data frame. Remember that you can select a column of a data frame using the $ sign. You find the names of the columns in varLabels: there is one column named sample ph$sample To look at all the data in the data frame ask for the data slot. ph@data Another way of looking at the sample annotation is using the pData() method on the AffyBatch object: pData(data) |
If phenoData contains no information: CEL-files are given an index 1-n (n = total number of files). You can give the samples more accurate names so these can be used in the plots that we are going to create later on. We will see how to do this when we create the plots.
When you import CEL-files from GEO or ArrayExpress the phenoData should normally already contain informative names but many submitters skip this step so many data sets are not well annotated.
Retrieving probe annotation using affy
Microarray data sets should include information on the probes. AffyBatches have a slot called featureData, a data frame that contains labels for the probes. For most data sets (also public data coming from GEO or ArrayExpress) the featureData has not been defined.
How to retrieve the probe annotation of the data ? |
---|
feat = data@featureData feat feat@data |
Retrieving experiment annotation using affy
Microarray data sets should also include information on the experiment. AffyBatches have a slot for this called experimentData. For most data sets (also public data coming from GEO or ArrayExpress) the featureData has not been defined.
How to retrieve the experiment annotation ? |
---|
exp = data@experimentData exp |
Retrieving annotation using oligo
All commands described above also work for the oligo package.
Retrieving other types of information
How to retrieve the name of the CDF file associated with the arrays ? |
---|
Use the cfdName() method on the AffyBatch
cdfName(data) |
How to retrieve the IDs of the probe sets that are represented on the arrays ? |
---|
Use the featureNames() method on the AffyBatch
featureNames(data) |
How to retrieve the number of probe sets represented on the arrays ? |
---|
Use the length() method to count the number of items in a vector
length(featureNames(data)) |
How to retrieve the number of probes represented on the arrays ? |
---|
Use the length() method to count the number of items in the vector containing names of all probes
length(probeNames(data)) |
On Affymetrix arrays you should see an average of 10-20 probes per gene.
Quality control of microarray data
Giving the samples informative names
The code in this section works for both affy and oligo
If the phenoData object, that was created in the step where we retrieved the sample annotation, does not contain any information, Bioconductor will just give the CEL-files an index 1-6. However, the phenoData will be used as labels in plots. Give the samples more accurate names so they can be used in the plots that we are going to create.How to add annotation to phenoData ? |
---|
We define that the first column of the data slot in the phenoData corresponds to the vector containing the sample names created by the c() command:
ph@data[ ,1] = c("wt1","wt2","wt3","mut1","mut2","mut3") ph This is an example where we have a data set consisting of two groups of 3 replicates, 3 wild type controls and 3 mutants. ph@data[ ,1] = c("control1","control2","control3","iso1","iso2","iso3","swim1","swim2","swim3") ph@data This is an example where we have a data set consisting of three groups of 3 replicates, 3 control mice, 3 mice that were treated with a drug and 3 mice were treated by performing physical exercises |
You have to adjust this code according the number of groups and the number of replicates you have and change the sample names to names that are relevant for you. The order of the samples in the AffyBatch is determined by the CEL-file name of the sample (the CEL files are alphabetically ordered).
Create plots to assess the quality of the data
Instead of printing these plots in RStudio or the R editor, we will save the plots to our hard drive.
Microarray pictures
Microarray pictures can show large inconsistencies on individual arrays.
How to create microarray pictures
Chip pseudo-images
How to create chip pseudo-images
Histograms
Another quality control check is to plot the distribution of log base 2 intensities (log2(PMij) for array i and probe j) of perfect match probes for comparison of probe intensity behavior between different arrays. If you see differences in shape or center of the distributions, it means that normalization is required.
How to create histograms of microarray data
Box plots
Boxplots and histograms show the same differences in probe intensity behavior between arrays. In order to perform meaningful statistical analysis and inferences from the data, you need to ensure that all the samples are comparable. To examine and compare the overall distribution of log transformed PM intensities between the samples you can use a histogram but you will get a clearer view with a box plot. Box plots show:
- the median: center value, half of the intensities are lower than this value, half of the intensities are higher (= line in the box)
- the upper quartile: a quarter of the values are higher than this quartile (= upper border of the box)
- the lower quartile: a quarter of the values are lower than this quartile (= lower border of the box)
- the range: minimum and maximum value (= borders of the whiskers)
- individual extreme values (= points outside the whiskers)
How to create boxplots of microarray data
MA plots
Originally these MA plots were developed for two-color arrays to detect differences between the two color labels on the same array, and for these arrays they became hugely popular. This is why more and more people are now also using them for Affymetrix arrays but on Affymetrix only use a single color label. So people started using them to compare each Affymetrix array to a pseudo-array. The pseudo array consists of the median intensity of each probe over all arrays.
The MA plot shows to what extent the variability in expression depends on the expression level (more variation on high expression values?). In an MA-plot, A is plotted versus M:
- M is the difference between the intensity of a probe on the array and the median intensity of that probe over all arrays
M = logPMInt_array – logPMInt_medianarray - A is the average of the intensity of a probe on that array and the median intesity of that probe over all arrays
A = (logPMInt_array + logPMInt_medianarray)/2
How to create MA plot of microarray data
Ideally, the cloud of data points in the MA-plot should be centered around M=0 (blue line). This is because we assume that the majority of the genes is not DE and that the number of upregulated genes is similar to the number of downregulated genes. Additionally, the variability of the M values should be similar across different array-medianarray combinations. You see that the spread of the point cloud increases with the average intensity: the loess curve (red line) moves further and further away from M=0 when A increases. To remove (some of this) dependency, we will normalize the data.
Calculate quality measures to assess the quality of the data
Calculate quality measures in affy
Apart from images, you can also calculate simple quality metrics to assess the quality of the arrays.
How to calculate the quality measures of each array ? |
---|
Quality measures can be calculated by applying the qc() method on an AffyBatch object.
data.qc = qc(data) The first quality measure are the average intensities of the background probes on each array. avbg(data.qc) The second quality measure are the scale factors: factors used to equalize the mean intensities of the arrays. sfs(data.qc) The third quality measure are the percent present calls: the percentage of spots that generate a significant signal (significantly higher than background) according to the Affymetrix detection algorithm. percent.present(data.qc) The last quality measure are the 3'/5' ratios of the quality control probe sets representing housekeeping genes (genes expressed in any tissue in any organism) like actin and GADPH. For these housekeeping genes 3 probe sets are available: one at the 5' end of the gene, one at the middle and one at the 3' end of the gene. ratios(data.qc) |
Calculate quality measures in oligo
The qc() method is implemented in the simpleaffy package but not in the oligo package. This means that you either have to
- load both simpleaffy and oligo but then you always have the problem that you have to explicitly write oligo:: in front of every method you want to use from the oligo package (see section on using a method from a specific package)
- first load simpleaffy, perform the qc() analysis, restart R, load oligo and perform the rest of the analysis
Normalization of microarray data
There are many sources of noise in microarray experiments:
- different amounts of RNA used for labeling and hybridization
- imperfections on the array surface
- imperfect synthesis of the probes
- differences in hybridization conditions
Normalization using RMA
The code in this section works for both affy and oligo
The standard method for normalization is RMA. RMA is one of the few normalization methods that only uses the PM probes:
- Background correction to correct for spatial variation within individual arrays: a background-corrected intensity is calculated for each PM probe in such a way that all background corrected intensities are positive
- Log transformation to improve the distribution of the data: the base-2 logarithm of the background corrected intensity is calculated for each probe. The log transformation will make the data less skewed and more normally distributed and provide an equal spread of up- and downregulated expression ratios
- Quantile normalization to correct for variation between the arrays: equalizes the data distributions of the arrays and make the samples completely comparable
- Probe normalization to correct for variation within probe sets: equalizes the behavior of the probes between the arrays and combines normalized data values of probes from a probe set into a single value for the whole probe set
How to normalize the data using RMA ? |
---|
The rma() method in the affy package produces a data matrix for Affymetrix arrays. The input for rma() is an AffyBatch object while the output from rma() is an exprSet object with the data matrix containing the normalized log-intensities in the exprs slot.
data.rma = rma(data) data.matrix = exprs(data.rma) The ExpressionSet class is the superclass of the AffyBatch class meaning that AffyBatches are a special type of ExpressionSets. AffyBatches will therefore have the same characteristics and behaviour as ExpressionSets but AffyBatches will also have a set of specific characteristics and functions that are not shared by ExpressionSets. |
Check out this excellent overview of RMA.
Normalization using GCRMA
GCRMA in affy
GCRMA is based on RMA, having all the good sides of RMA. The difference lies in the background correction, all other steps are the same. GCRMA corrects for non-specific binding to the probes in contrast to RMA which completely ignores the issue of non-specific binding.
GCRMA uses probe sequence information to estimate probe affinity to non-specific binding. Each nucleotide of the probe contributes to the affinity of the probe. The contributions of each nucleotide on each position of the probes were estimated in a pilot experiment with artificial DNAs that mismatch the probes, done by the people who developed the algorithm. In this experiment there was no specific binding so the only thing was measured was non-specific binding. The data of this experiment allowed to estimate the affinities of the probes.
GCRMA allows you to use the affinities calculated based on this reference experiment or you can let GCRMA compute probe affinities based on the signals of the negative control probes of your own microarray experiment.
The gcrma package contains all available methods to perform GCRMA normalization.
How to do a GCRMA normalization ? |
---|
First of all, you need to load the gcrma package:
library(“gcrma”) To perform GCRMA normalization: data.gcrma = gcrma(data) As said before, GCRMA uses the affinity of each probe during background correction. These probe affinities are stored in an AffyBatch object, called affinity.info. my.affinity.info = compute.affinities.local(data) data.gcrma = gcrma(data,affinity.info=my.affinity.info) The gcrma command comes with two additional arguments
So to speed up calculations use the following command: data.gcrma = gcrma(data,fadt=TRUE) |
GCRMA in oligo
GCRMA makes use of the IndexProbes() method a lot, which is a method that works on AffyBatches and is not implemented for FeatureSets. As a result, there is no easy way to do GCRMA normalization in the oligo package.
Checking the effect of the normalization
Comparing raw and background-corrected data
To see the effect of the background correction you can create a plot of raw versus background corrected data.
How to compare raw and background-corrected microarray data
Comparing raw and normalized data
Comparing raw and normalized data in affy
After normalization you can compare raw and normalized data. You can do this for individual genes or for all genes.
We'll start by making the comparison for individual genes.
How to compare raw and normalized intensities for individual genes ? |
---|
The normalized intensities are stored in data.matrix. The probe set IDs are used as row names:
data.matrix["256852_at",] Raw intensities are stored in data, you can retrieve the raw PM intensities by using the pm() method. By using the probe set ID as a second argument, you can retrieve the PM intensities of the row with this name: pm(data,"256852_at")
|
Comparing raw and normalized data in oligo
After normalization oligo does use probe set IDs as row names in the data.matrix object so you can retrieve normalized data for a specific probe set e.g. for HTA 2.0 arrays:
data.matrix["2824546_st",]
However, retrieving raw data by specifying a probe set ID in the pm() method does not work in oligo.
Comparing raw and normalized data using boxplots
How to create boxplots of microarray data
After normalization, none of the samples should stand out from the rest. The different arrays should have a very comparable median expression level. Also the scale of the boxes should be very comparable indicating that the spread of the intensity values on the different arrays is equalized.
Comparing raw and normalized data using MA plots
How to create MA plot of microarray data
When you compare this plot to the one created for the raw intensities on the same array, you see a much more symmetric and even spread of the data points indicating that the dependence of the variability on the expression level is not as strong as it was before normalization.
PCA plot
To check whether the overall variability of the samples reflects their grouping, you can perform a Principal Component Analysis. In other words, you can check if replicates are homogenous and distinguishable from samples of (the) other group(s).
How to create a PCA plot of microarray data
When you have groups in your data this should lead to a clear distinction between the groups.
Identification of DE genes
Identification of DE genes is not done by the affy nor the oligo package but by the limma package. Limma uses the output of the rma() method (data.rma) as input. Since the output of the rma() method is the same in the affy and in the oligo package, limma works well with both packages. This means that all following code is valid for all normalized Affymetrix data regardless of the package that was used for normalization.
Two groups of samples
The limma package contains functions for using a t-test or an ANOVA to identify differential expression in microarray data. These functions can be used for all array platforms and work even for microarray data with complex designs of multiple samples. The central idea is to fit a linear model to the expression data of each gene. The expression data can be log-ratios or log-intensities. Computational methods are used to borrow information across genes making the analyses stable even for experiments with a small number of arrays. Limma is designed to be used in conjunction with the affy package.
How to use limma for comparing two groups of samples ? |
---|
First of all you need to tell limma which samples are replicates and which samples belong to different groups by providing this information in the phenoData slot of the AffyBatch/FeatureSet. To this end, we add a second column with sample annotation describing the source of each sample. We will give this new column a name.
ph@data[ ,2] = c("control","control","control","mutant","mutant","mutant") colnames(ph@data)[2]="source" ph@data Now you need to tell limma which sample belongs to which group. This info is contained in the second column (the column that we called source) of the PhenoData. groups = ph@data$source The names of the groups have to be transformed into factors. Factors in statistics represent the variable that you control: which sample belongs to which group. You can factorize the grouping variable by using the factor() method f = factor(groups,levels=c("control","mutant")) Then you need to create a design matrix, a matrix of values of the grouping variable. ANOVA needs such a matrix to know which samples belong to which group. Since limma performs an ANOVA or a t-test (which is just a specific case of ANOVA), it needs such a design matrix. You can create it using the model.matrix() method. design = model.matrix(~ 0 + f) colnames(design) = c("control","mutant") Essentially, limma will compare the mean expression level in the control samples to the mean expression level in mutant samples and this for each gene. So first of all, limma needs to calculate the mean expression levels using the lmFit() method. This method will fit a linear model (defined in design) to the data to calculate the mean expression level in the control and in the mutant samples: data.fit = lmFit(data.matrix,design) You can view the results of the fit. For instance you can show the estimates of the fit for the first 10 probe sets. data.fit$coefficients[1:10,]
The first column contains the mean log expression in control samples. Now you have to tell limma which groups you want to compare. For this you define a contrast matrix defining the contrasts (comparisons) of interest by using the makeContrasts() method. Using the column names of the design matrix you specify the baseline sample (control) you want to compare to and the sample of interest (mutant). contrast.matrix = makeContrasts(mutant-control,levels=design) data.fit.con = contrasts.fit(data.fit,contrast.matrix)
data.fit.eb = eBayes(data.fit.con) The eBayes() method returns a data frame data.fit.eb containing multiple slots. To get an overview of all the slots, use the names() method: names(data.fit.eb) You can retrieve the log fold changes of each gene via the coefficients slot: data.fit.eb$coefficients[1:10,] This slot contains the coefficient of the contrast: it's simply the difference between the mean log expression in mutant samples and the mean log expression in control samples. This difference is usually called a log fold change.
The eBayes() method has performed a moderated t-test on each gene. That means it has calculated a t-statistic and a corresponding p-value for each gene. These values can be found in the t and the p.valuje slot. To view the t-statistics and p-values of the moderated t-test for the first 10 probe sets: data.fit.eb$t[1:10,] data.fit.eb$p.value[1:10,]
Remarkably, data.fit.eb also contains F-statistics and p-values for this F-statistic in the F and F.p.value slots while an F-statistic is in an ANOVA. Essentially, a t-test is a special case of an ANOVA used for single comparisons. If you have just a single comparison the F-statistic is the square of the t-statistic. For multiple comparisons, this relation is not true.
|
In the above example we compared samples from two groups but in many microarray experiments, there are more than two groups. In this case we have to perform a moderated one-way ANOVA for each gene.
Three groups of samples
How to compare three groups ? |
---|
As an example we will compare drug treated mice and mice treated with physical exercise to a set of control mice. First of all you need to tell limma which samples are replicates and which samples belong to different groups by providing this information in the phenoData slot of the AffyBatch/FeatureSet. To this end, we add a second column with sample annotation describing the source of each sample. We will give this new column a name: ph@data[ ,2] = c("control","control","control","drug","drug","drug","exercise","exercise","exercise") colnames(ph@data)[2]="source" ph@data Since we have 3 groups with 3 replicates each, the factor that determines the grouping will have 3 levels instead of 2, so the code is as follows: f = factor(groups,levels = c("control","drug","exercise")) Then you need to create a design matrix, a matrix of values of the grouping variable. ANOVA needs such a matrix to know which samples belong to which group. Since limma performs an ANOVA, it needs such a design matrix. You can create it using the model.matrix() method. design = model.matrix(~ 0 + f) colnames(design)=c("control","drug","exercise") design data.fit = lmFit(data.rma,design)
Afterwards you can tell limma which groups you want to compare. For this you define a contrast matrix defining the contrasts (comparisons) of interest by using the makeContrasts() method. In this example we make all pairwise comparisons: contrast.matrix = makeContrasts(exercise-control,drug-exercise,drug-control,levels=design) data.fit.con = contrasts.fit(data.fit,contrast.matrix) data.fit.eb = eBayes(data.fit.con) You can view the results of the ANOVA in the slots of the data.fit.eb object. The statistic that is calculated in ANOVA is the F-statistic, you can find the F-statistic and its corresponding p-value for each gene in the F and F.p.value slots. data.fit.eb$coefficients[1:10,] data.fit.eb$t[1:10,] data.fit.eb$p.value[1:10,] data.fit.eb$F[1:10] data.fit.eb$F.p.value[1:10] |
Paired data
Up to now we have always assumed that the groups are independent, meaning that samples of one group have no relation whatsoever with samples from another group. In independent data sets you have two groups of subjects (patients, mice, plants,....). You use one group as controls and you give the other group a treatment.
Paired or dependent data means that there exists a relationship between the samples of different groups. The typical example is having a group of subjects and measuring each of them before and after treatment.
It is important to tell limma if your data is paired or not since you need to use a different type of statistical test on paired compared to independent data:
- paired t-test instead of regular two-samples t-test
- repeated measures ANOVA instead of regular one-way ANOVA
How to compare two groups of paired data ? |
---|
As an example we will compare 2 groups of 3 patients before and after treatment:
Treatment is the grouping variable dividing the data set into two groups: before and after treatment. To tell limma that your data is paired you just create a second grouping variable called patient: ph@data[ ,2] = c("before","treated","before","treated","before","treated") colnames(ph@data)[2]="Treatment" ph@data[ ,3] = c("patient1","patient1","patient2","patient2","patient3","patient3") colnames(ph@data)[3]="Patient" ph@data Of course, now you need to factorize both grouping variables: groupsP = ph@data$Patient groupsT = ph@data$Treatment fp = factor(groupsP,levels=c("patient1","patient2","patient3")) ft = factor(groupsT,levels=c("before","treated")) Then you need to create a design matrix, a matrix of values of the grouping variable. ANOVA needs such a matrix to know which samples belong to which group. Since limma performs an ANOVA, it needs such a design matrix. You can create it using the model.matrix() method. paired.design = model.matrix(~ fp + ft) colnames(paired.design)=c("Intercept","Patient2vsPatient1","Patient3versusPatient1","AftervsBefore") data.fit = lmFit(data.rma,paired.design) data.fit$coefficients Lmfit() will fit a linear model to the data. It will create a data frame called data.fit. The interesting data is in the coefficients table which contains 4 columns:
Performing a moderated paired t-test is now done using eBayes(). data.fit.eb = eBayes(data.fit) data.fit.eb$p.value[,1:20] Again, the relevant p-values are in the fourth column, called Treatment. These are the p-values generated by the comparison of after treatment and before treatment. |
How to compare three groups of paired data ? |
---|
As an example we will compare 3 groups of 3 patients before, during and after treatment:
Treatment is the grouping variable dividing the data set into two groups: before and after treatment. To tell limma that your data is paired you just create a second grouping variable called patient: ph@data[ ,2] = c("before","during","treated","before","during","treated","before","during","treated") colnames(ph@data)[2]="Treatment" ph@data[ ,3] = c("P1","P1","P1","P2","P2","P2","P3","P3","P3") colnames(ph@data)[3]="Patient" ph@data Of course, now you need to factorize both grouping variables: groupsP = ph@data$Patient groupsT = ph@data$Treatment fp = factor(groupsP,levels=c("P1","P2","P3")) ft = factor(groupsT,levels=c("before","during","treated")) Then you need to create a design matrix, a matrix of values of the grouping variable. ANOVA needs such a matrix to know which samples belong to which group. Since limma performs an ANOVA, it needs such a design matrix. You can create it using the model.matrix() method. paired.design = model.matrix(~ fp + ft) colnames(paired.design)=c("Intercept","P2vsP1","P3vsP1","DuringvsBefore","AftervsBefore") data.fit = lmFit(data.rma,paired.design) data.fit$coefficients Lmfit() will fit a linear model to the data. It will create a data frame called data.fit. The interesting data is in the coefficients table which contains 5 columns:
Performing the repeated measures ANOVA is now done using eBayes(). data.fit.eb = eBayes(data.fit) data.fit.eb$p.value[,1:20] Again, the relevant p-values are in the fourth and fifth column, called DuringvsBefore and AftervsBefore. These are the p-values generated by the comparison of bfore, during and after treatment. What if you also want to compare after and during treatment ? cont.matrix = makeContrasts(aftervsduring=after-during,levels=paired.design) data.contr = contrasts.fit(data.fit,cont.matrix) data.contr$coefficients data.fit.eb = eBayes(data.contr) |
Two grouping variables
Sometimes you have two grouping variables, e.g.
- gender and treatment if you think a drug has a different effect on men and women
- strain and treatment if you think a drug has a different effect on different mice strains
How to compare microarray data grouped according to two variables
Generating a Volcano plot
The best way to decide on the number of DE genes you are going to select is via a Volcano plot.
How to generate a Volcano plot of the data ? |
---|
A Volcano plot is generated by using the volcanoplot() method on the output of the moderated t-test. The coef parameter specifies the column of data.fit.eb that should be used for the plot. In our example in Arabidopsis coef=2 since the second column of data.fit.eb contains the results of the comparison between mutant and control plants.
So in this example you set coef=2 or coef=3 if you want to find genes that are DE between patients or coef=4 if you want to find genes that are DE as a result of the treatment. The highlight parameter allows to specify the number of highest scoring genes (on the Y-axis) for which names will be attached on the plot. name = "Volcano.jpg" jpeg(name) volcanoplot(data.fit.eb,coef=2,highlight=10) dev.off() |
Volcano plots arrange genes along biological and statistical significance. The X-axis gives the log fold change between the two groups (log: so that up and down regulation appear symmetric), and the Y-axis represents the p-value of a t-test comparing samples (on a negative log scale so smaller p-values appear higher up). The first axis indicates biological impact of the change; the second indicates the statistical evidence of the change.
To decide on the number of DE genes that you’re going to proceed with, you can make Volcano plots highlighting different numbers of genes. As you increase the number of genes highlighted on the plot, you can see at what point you begin to select genes with rather low fold changes. Make a conservative decision about the number of genes you want to use for follow up.
Adjusting for multiple testing and defining DE genes
You are doing a t-test on each gene, meaning that you will be doing more than 20000 t-tests on the data set. You have to adjust the p-values of the t-tests for multiple testing or you will generate too many false positives.
Of course your final aim is to generate a table containing the genes that you consider DE (the genes with the lowest adjusted p-values and the most extreme log fold changes). You can then use this table to search for functional relations between these genes...
How to define a table of DE genes in a microarray experiment
Creating lists of probe set IDs of DE genes for functional analysis
We go back to our simple comparison of mutant and wild-type samples. We have created a list of upregulated genes called topups and a list of downregulated genes (topdowns).
How to create a list of IDs of genes that are DE according to topTable() ? |
---|
The tab table that is generated by the topTable() method and all the tables that are derived from it like topups and topdowns contain the probe set IDs of the selected genes as row names:
IDs.up = rownames(topups) IDs.down = rownames(topdowns) |
What if you used decideTests() to identify DE genes ? This method generates a matrix containing labels (0, 1 or -1) for each gene in each contrast.
How to create a list of IDs of genes that are DE according to decideTests() ? |
---|
Assume that we want to obtain the lists of genes that are up/downregulated in the exercise group compared to the control group. The labels for this contrast are stored in the first column of the DEresults matrix that was generated by the decideTests() method. First you select the upregulated genes (genes with value 1 in the first column) via the subset() method. Then you obtain their IDs via the rownames() method. ups = subset(DEresults, DEresults[,1]==1) downs = subset(DEresults, DEresults[,1]==-1) IDs.up = rownames(ups) IDs.down = rownames(downs) |
You'll probably want to write the list of IDs to your computer so that you can use it in tools like DAVID.
How to write the probe set IDs of the DE genes to your computer ? |
---|
We will create a text file with one ID on each line since this is a format that is accepted by most tools. You can do this by using the write.table() method.
write.table(IDs.up,row.names=FALSE,col.names=FALSE,quote=FALSE,file="d:/trainingen/R/upIDs.txt") write.table(IDs.down,row.names=FALSE,col.names=FALSE,quote=FALSE,file="d:/trainingen/R/downIDs.txt") |
Creating a heat map of the DE genes
How to create a heat map of the normalized intensities of the DE genes ? |
---|
In this example we will create a heat map of the downregulated genes. Heat maps can be created via the ggplot() method. As always we need to get the data into the correct format for ggplot: a data matrix with
We first have to obtain the normalized intensities of the upregulated genes. The normalized intensities are stored in data.matrix. The downregulated genes are stored in topdowns. We can obtain their probe set IDs via the rownames() method. Then we select their normalized intensites from data.matrix using their probe set IDs: data.matrix.up = data.matrix[(rownames(topdowns)),] We create three vectors:
The heatlogs vector will contain the normalized intensities of the upregulated genes in all six samples stacked into a single column. There are 71 downregulated genes in my data set. I can see that by using the dim() method and looking at the first number it returns (dim(topdowns)[1]). This means that the sampleNames vector needs 254 entries of each sample name stacked into a single column. The featureNames vector needs the probe set IDs of these genes repeated 6 times and stacked into a single column. sampleNames = vector() featureNames = vector() heatlogs = vector() for (i in 1:6) { sampleNames = c(sampleNames,rep(ph@data[i,1],dim(topdowns)[1])) featureNames = c(featureNames,rownames(data.matrix.up[1:dim(topdowns)[1],])) heatlogs = c(heatlogs,data.matrix.up[1:dim(topdowns)[1],i]) } Then we combine sample names, probe set IDs and normalized intensities into one data frame: heatData = data.frame(norm_logInt=heatlogs,sampleName=sampleNames,featureName=featureNames) Now we can create the plot. A heat map can be created using the geom_tile() method. Low normalized intensities will be plotted in green and high normalized intensities will be plotted in red. This can be done by using the scale_fill_gradient() method. dataHeat = ggplot(heatData, aes(sampleName,featureName)) dataHeat + geom_tile(aes(fill=norm_logInt)) + scale_fill_gradient(low="green", high="red") |
Creating a Venn diagram to compare results of multiple comparisons
This can be easily done on the output of the decideTests() method.
How to create a Venn diagram ? |
---|
A Venn diagram can be created using the vennDiagram() method.
vennDiagram(DEresults) |