Tutorial: Basic pre-processing

From BITS wiki
Jump to: navigation, search
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Tutorials


Pre-processing is a generic term for technology-specific data processing steps that are necessary before the "real" analysis. What needs to be done depends on the exact details of the measurement technology and the manufacturing details, whereas the subsequent so-called "high level" analysis can be quite unspecific and similar between different technologies or applications.

For the specific example of Affymetrix oligonucleotide chips, pre-processing describes the steps between raw probe-level data to properly normalized probeset-level data. These steps will usually include some kind of background correction, some kind of summary calculation to combine multiple probe measurements into one expression value representative for the probeset, as well as some kind normalization to eliminate purely technical differences between chips (scanner settings etc.). A subsequent cluster analysis generally does not care whether the expression values come from an Affymetrix or Illumina chip, or from mass spectrometry.

In high-throughput biology as everywhere else, there is a software gap: when the new hardware hits the market, the pre-processing technology/software is generally only just being figured out. If the technology is popular, a whole range of new, sometimes even better methods will crop up, leading to a bewildering variety of methods for individual pre-processing step. Bioconductor is not much help here, in that it offers you (almost) everything and the kitchen sink.

Let's look at what we have. The basic affy package offers utility functions rma and justRMA for RMA and mas5 for MAS5. The package gcrma has the function gcrma which calculates, exactly, GCRMA expression values. The package simpleaffy contributes the utility functions justMAS5 and justGCRMA.

This is the output from running MAS5 on the estrogen data:

> estroMAS5=mas5(estroraw)
background correction: mas 
PM/MM correction : mas 
expression values: mas 
background correcting...done.
12625 ids to be processed
|                    |
|####################|

This is the output from running RMA on the estrogen data:

> estroRMA=rma(estroraw)
Background correcting
Normalizing
Calculating Expression

In both cases, we see similar steps, like background correction and calculating expressions etc., though not necessarily all the same and in the same order. Package affy has the extremely powerful function expresso that takes this idea to its logical conclusion: the user can freely combine e.g. normalization method of RMA with the background correction from MAS5 and the summary calculation of Li-Wong (dChip). As a matter of fact, the function mas5 is just a call to expresso with all options set to mas:

function (object, normalize = TRUE, sc = 500, analysis = "absolute", 
    ...) 
{
    res <- expresso(object, bgcorrect.method = "mas", pmcorrect.method = "mas", 
        normalize = FALSE, summary.method = "mas", ...)
    if (normalize) 
        res <- affy.scalevalue.exprSet(res, sc = sc, analysis = analysis)
    return(res)
}

In practice, this has relatively little relevance: we do not want to consider (and evaluate) countless possible combinations every single time we want to run 3 wt vs 3 ko. Instead, people focus on a number of standard combinations:

  • RMA: rarely wrong
  • GCRMA: may offer slight advantages, but numerically quite expensive
  • Don't use MAS5 unless you at least log (or otherwise transform) the expression values, and preferably normalize the values, too.
  • Careful with pliers, the Affymetrix proposal for replacing MAS5 (see exercise).