Tutorial: Basic pre-processing
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Tutorials
Pre-processing is a generic term for technology-specific data processing steps that are necessary before the "real" analysis. What needs to be done depends on the exact details of the measurement technology and the manufacturing details, whereas the subsequent so-called "high level" analysis can be quite unspecific and similar between different technologies or applications.
For the specific example of Affymetrix oligonucleotide chips, pre-processing describes the steps between raw probe-level data to properly normalized probeset-level data. These steps will usually include some kind of background correction, some kind of summary calculation to combine multiple probe measurements into one expression value representative for the probeset, as well as some kind normalization to eliminate purely technical differences between chips (scanner settings etc.). A subsequent cluster analysis generally does not care whether the expression values come from an Affymetrix or Illumina chip, or from mass spectrometry.
In high-throughput biology as everywhere else, there is a software gap: when the new hardware hits the market, the pre-processing technology/software is generally only just being figured out. If the technology is popular, a whole range of new, sometimes even better methods will crop up, leading to a bewildering variety of methods for individual pre-processing step. Bioconductor is not much help here, in that it offers you (almost) everything and the kitchen sink.
Let's look at what we have. The basic affy
package offers utility functions rma
and justRMA
for RMA and mas5
for MAS5. The package gcrma
has the function gcrma
which calculates, exactly, GCRMA expression values. The package simpleaffy
contributes the utility functions justMAS5
and justGCRMA
.
This is the output from running MAS5 on the estrogen data:
> estroMAS5=mas5(estroraw) background correction: mas PM/MM correction : mas expression values: mas background correcting...done. 12625 ids to be processed | | |####################|
This is the output from running RMA on the estrogen data:
> estroRMA=rma(estroraw) Background correcting Normalizing Calculating Expression
In both cases, we see similar steps, like background correction and calculating expressions etc., though not necessarily all the same and in the same order. Package affy
has the extremely powerful function expresso
that takes this idea to its logical conclusion: the user can freely combine e.g. normalization method of RMA with the background correction from MAS5 and the summary calculation of Li-Wong (dChip). As a matter of fact, the function mas5
is just a call to expresso
with all options set to mas
:
function (object, normalize = TRUE, sc = 500, analysis = "absolute", ...) { res <- expresso(object, bgcorrect.method = "mas", pmcorrect.method = "mas", normalize = FALSE, summary.method = "mas", ...) if (normalize) res <- affy.scalevalue.exprSet(res, sc = sc, analysis = analysis) return(res) }
In practice, this has relatively little relevance: we do not want to consider (and evaluate) countless possible combinations every single time we want to run 3 wt vs 3 ko. Instead, people focus on a number of standard combinations:
- RMA: rarely wrong
- GCRMA: may offer slight advantages, but numerically quite expensive
- Don't use MAS5 unless you at least log (or otherwise transform) the expression values, and preferably normalize the values, too.
- Careful with pliers, the Affymetrix proposal for replacing MAS5 (see exercise).