Tutorial: Reading and storing expression data

From BITS wiki
Jump to: navigation, search
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Training Units

Cel files

CEL files contain one intensity value per feature (probe). We will use a convenient set of eight files that are available as package estrogen from Bioconductor.

As raw data (AffyBatch)

Generally, we want to read the information in the CEL files as it is, so we can perform quality control on the individual probe level. The command for reading a list of files is ReadAffy from package affy.

First, we get our list of file names. The first command is a bit of system magic to get the name of the directory, regardless of where R is installed; for your own data, you can simply specify your own directory name.

> estrodir = system.file("extdata", package="estrogen")
> estrodir
[1] "/home/ploner/R/i486-pc-linux-gnu-library/2.11/estrogen/extdata"

Let's have a look at the available cel files in the estrogen directory:

> library(affy)
> list.celfiles(estrodir)
[1] "bad.cel"      "high10-1.cel" "high10-2.cel" "high48-1.cel" "high48-2.cel"
[6] "low10-1.cel"  "low10-2.cel"  "low48-1.cel"  "low48-2.cel"

The first file is not part of the experiment and only included for didactic purposes. There are different ways of getting only the relevant CEL files, but this is the simplest one:

> estrofiles = list.celfiles(estrodir)[2:9]
> estrofiles
[1] "high10-1.cel" "high10-2.cel" "high48-1.cel" "high48-2.cel" "low10-1.cel" 
[6] "low10-2.cel"  "low48-1.cel"  "low48-2.cel"

Now that we know which files to read and where, we can use the command ReadAffy:

> estroraw = ReadAffy(filenames=estrofiles, celfile.path=estrodir)
> estroraw
AffyBatch object
size of arrays=640x640 features (11 kb)
cdf=HG_U95Av2 (12625 affyids)
number of samples=8
number of genes=12625
annotation=hgu95av2
notes=

This generates an object of class AffyBatch with eight samples and 12625 probesets. Note that ReadAffy has identified the chip type as HG_U95Av2.

As processed data (ExpressionSet)

Sometimes we do not want to store the probe-level data in R, but instead read in the raw information from the CEL files, pre-process it on the fly and store only the probeset-level information. This makes sense e.g. if we just replicate a previous analysis, or we have done our quality control before and want to save memory. In this case, we can use the function justRMA with the same parameters as ReadAffy:

> estroRMA = justRMA(filenames=estrofiles, celfile.path=estrodir)
> estroRMA
ExpressionSet (storageMode: lockedEnvironment)
assayData: 12625 features, 8 samples 
  element names: exprs, se.exprs 
protocolData
  sampleNames: high10-1.cel, high10-2.cel, ..., low48-2.cel  (8 total)
  varLabels and varMetadata description:
    ScanDate: NA
phenoData
  sampleNames: high10-1.cel, high10-2.cel, ..., low48-2.cel  (8 total)
  varLabels and varMetadata description:
    sample: arbitrary numbering
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hgu95av2

This generates an object of class ExpressionSet, which is the standard container for a pre-processed expression data set ready for high-level analysis. Note that a lot of the components of this object are the same as for the AffyBatch-object that we created before, however, internally there is no probe-level information available.

Adding phenotypic information

We get some idea from the file names about which sample received which treatment, however, for a formal analysis, we have to provide this phenotypic data in a clearer format. Generally, this means a data frame, which is the R name for a data matrix: a rectangular arrangement of measurements, with one row per subject/sample and one column per measurement/variable. This phenotypic information can and should be stored as part of the corresponding AffyBatch or ExpressionSet.

Let's see for the case of the estrogen experiment. We have several files in data directory that could be relevant:

> list.files(estrodir)
 [1] "bad.cel"       "estrogen.txt"  "high10-1.cel"  "high10-2.cel" 
 [5] "high48-1.cel"  "high48-2.cel"  "low10-1.cel"   "low10-2.cel"  
 [9] "low48-1.cel"   "low48-2.cel"   "phenoData.txt" "targLimma.txt"
[13] "test.html" 

A closer look at estrogen.txt shows us that this is just we need:

> file.show("estrogen.txt")
# try replacing the previous command by:
> file.show("C://PROGRA~1//R//R-211~1.1//library//estrogen//extdata//estrogen.txt")
filename     estrogen    time.h
low10-1.cel  absent      10
low10-2.cel  absent      10
high10-1.cel present     10
high10-2.cel present     10
low48-1.cel  absent      48
low48-2.cel  absent      48
high48-1.cel present     48
high48-2.cel present     48

The command file.show needs the document in the working directory, which you can set by using setwd().
We can read in the annotation in this format using the utility function read.AnnotatedDataFrame:

> estrodat = read.AnnotatedDataFrame("estrogen.txt", path=estrodir, sep="") 
 # add ',fill=TRUE' at the end of the line to correct for header problem
> estrodat
An object of class "AnnotatedDataFrame"
  rowNames: low10-1.cel, low10-2.cel, ..., high48-2.cel  (8 total)
  varLabels and varMetadata description:
    estrogen: 
    time.h: 
> pData(estrodat)
             estrogen time.h
low10-1.cel    absent     10
low10-2.cel    absent     10
high10-1.cel  present     10
high10-2.cel  present     10
low48-1.cel    absent     48
low48-2.cel    absent     48
high48-1.cel  present     48
high48-2.cel  present     48

We store the phenodata, and more specifically the sample names (sampleNames()) as part of the raw expression data. We just have to make sure that they have the same order:

phenoData(estroraw) = estrodat[sampleNames(estroraw),]

Repositories

We demonstrate the process of downloading directly from a repository for the GEO data set (GDS) with the reference GDS2729. This is done via package GEOquery:

library(GEOquery)

This package offers a range of different methods for downloading expression data. The easiest one is to get the pre-processed matrix of expression value that was uploaded by the owners:

> gds = getGEO("GDS2729")
> eset = GDS2eSet(gds, do.log2 = TRUE)
> eset

Note how we also log2-transform the expression values. (Note: eSet stands for Expression-Set, a set of expression values on probeset (gene) level.)

The original gds object comes with a lot of meta-information, but no chip annotation. Instead, it refers to a GEO platform reference:

> Meta(gds)

You can find the GPL code linked to the used platform. We can easily download the annotation data of the platform, which gets updated regularly:

gpl = getGEO("GPL8300") 
Table(gpl)[1:5, 1:5]

Alternatively, we can use the Bioconductor annotation package for this chip, which happens to be the Affymetrix HG-U95Av2 chip. We do this by setting the annotation of the expression set:

annotation(eset) = "hgu95av2"

The actual information is stored in the package hgu95av2.db, which is loaded automatically if required.

Some contributors also upload the raw data, i.e. the CEL files. These are available as supplementary data from the GEO series associated with the data set:

> Meta(gds)$reference_series
[1] "GSE7114"
> rawlist = getGEOSuppFiles("GSE7114", makeDirectory = TRUE, baseDir = getwd())
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Training Units