Tutorial: Reading and storing expression data
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Training Units
Contents
Cel files
CEL files contain one intensity value per feature (probe). We will use a convenient set of eight files that are available as package estrogen
from Bioconductor.
As raw data (AffyBatch)
Generally, we want to read the information in the CEL files as it is, so we can perform quality control on the individual probe level. The command for reading a list of files is ReadAffy from package affy
.
First, we get our list of file names. The first command is a bit of system magic to get the name of the directory, regardless of where R is installed; for your own data, you can simply specify your own directory name.
> estrodir = system.file("extdata", package="estrogen") > estrodir [1] "/home/ploner/R/i486-pc-linux-gnu-library/2.11/estrogen/extdata"
Let's have a look at the available cel files in the estrogen directory:
> library(affy) > list.celfiles(estrodir) [1] "bad.cel" "high10-1.cel" "high10-2.cel" "high48-1.cel" "high48-2.cel" [6] "low10-1.cel" "low10-2.cel" "low48-1.cel" "low48-2.cel"
The first file is not part of the experiment and only included for didactic purposes. There are different ways of getting only the relevant CEL files, but this is the simplest one:
> estrofiles = list.celfiles(estrodir)[2:9] > estrofiles [1] "high10-1.cel" "high10-2.cel" "high48-1.cel" "high48-2.cel" "low10-1.cel" [6] "low10-2.cel" "low48-1.cel" "low48-2.cel"
Now that we know which files to read and where, we can use the command ReadAffy
:
> estroraw = ReadAffy(filenames=estrofiles, celfile.path=estrodir) > estroraw AffyBatch object size of arrays=640x640 features (11 kb) cdf=HG_U95Av2 (12625 affyids) number of samples=8 number of genes=12625 annotation=hgu95av2 notes=
This generates an object of class AffyBatch
with eight samples and 12625 probesets. Note that ReadAffy
has identified the chip type as HG_U95Av2.
As processed data (ExpressionSet)
Sometimes we do not want to store the probe-level data in R, but instead read in the raw information from the CEL files, pre-process it on the fly and store only the probeset-level information. This makes sense e.g. if we just replicate a previous analysis, or we have done our quality control before and want to save memory. In this case, we can use the function justRMA
with the same parameters as ReadAffy
:
> estroRMA = justRMA(filenames=estrofiles, celfile.path=estrodir) > estroRMA ExpressionSet (storageMode: lockedEnvironment) assayData: 12625 features, 8 samples element names: exprs, se.exprs protocolData sampleNames: high10-1.cel, high10-2.cel, ..., low48-2.cel (8 total) varLabels and varMetadata description: ScanDate: NA phenoData sampleNames: high10-1.cel, high10-2.cel, ..., low48-2.cel (8 total) varLabels and varMetadata description: sample: arbitrary numbering featureData: none experimentData: use 'experimentData(object)' Annotation: hgu95av2
This generates an object of class ExpressionSet
, which is the standard container for a pre-processed expression data set ready for high-level analysis. Note that a lot of the components of this object are the same as for the AffyBatch
-object that we created before, however, internally there is no probe-level information available.
Adding phenotypic information
We get some idea from the file names about which sample received which treatment, however, for a formal analysis, we have to provide this phenotypic data in a clearer format. Generally, this means a data frame, which is the R name for a data matrix: a rectangular arrangement of measurements, with one row per subject/sample and one column per measurement/variable. This phenotypic information can and should be stored as part of the corresponding AffyBatch
or ExpressionSet
.
Let's see for the case of the estrogen experiment. We have several files in data directory that could be relevant:
> list.files(estrodir) [1] "bad.cel" "estrogen.txt" "high10-1.cel" "high10-2.cel" [5] "high48-1.cel" "high48-2.cel" "low10-1.cel" "low10-2.cel" [9] "low48-1.cel" "low48-2.cel" "phenoData.txt" "targLimma.txt" [13] "test.html"
A closer look at estrogen.txt
shows us that this is just we need:
> file.show("estrogen.txt") # try replacing the previous command by: > file.show("C://PROGRA~1//R//R-211~1.1//library//estrogen//extdata//estrogen.txt") filename estrogen time.h low10-1.cel absent 10 low10-2.cel absent 10 high10-1.cel present 10 high10-2.cel present 10 low48-1.cel absent 48 low48-2.cel absent 48 high48-1.cel present 48 high48-2.cel present 48
The command file.show
needs the document in the working directory, which you can set by using setwd()
.
We can read in the annotation in this format using the utility function read.AnnotatedDataFrame
:
> estrodat = read.AnnotatedDataFrame("estrogen.txt", path=estrodir, sep="") # add ',fill=TRUE' at the end of the line to correct for header problem > estrodat An object of class "AnnotatedDataFrame" rowNames: low10-1.cel, low10-2.cel, ..., high48-2.cel (8 total) varLabels and varMetadata description: estrogen: time.h: > pData(estrodat) estrogen time.h low10-1.cel absent 10 low10-2.cel absent 10 high10-1.cel present 10 high10-2.cel present 10 low48-1.cel absent 48 low48-2.cel absent 48 high48-1.cel present 48 high48-2.cel present 48
We store the phenodata, and more specifically the sample names (sampleNames()
) as part of the raw expression data. We just have to make sure that they have the same order:
phenoData(estroraw) = estrodat[sampleNames(estroraw),]
Repositories
We demonstrate the process of downloading directly from a repository for the GEO data set (GDS) with the reference GDS2729
. This is done via package GEOquery
:
library(GEOquery)
This package offers a range of different methods for downloading expression data. The easiest one is to get the pre-processed matrix of expression value that was uploaded by the owners:
> gds = getGEO("GDS2729") > eset = GDS2eSet(gds, do.log2 = TRUE) > eset
Note how we also log2-transform the expression values. (Note: eSet stands for Expression-Set, a set of expression values on probeset (gene) level.)
The original gds
object comes with a lot of meta-information, but no chip annotation. Instead, it refers to a GEO platform reference:
> Meta(gds)
You can find the GPL code linked to the used platform. We can easily download the annotation data of the platform, which gets updated regularly:
gpl = getGEO("GPL8300") Table(gpl)[1:5, 1:5]
Alternatively, we can use the Bioconductor annotation package for this chip, which happens to be the Affymetrix HG-U95Av2 chip. We do this by setting the annotation of the expression set:
annotation(eset) = "hgu95av2"
The actual information is stored in the package hgu95av2.db
, which is loaded automatically if required.
Some contributors also upload the raw data, i.e. the CEL files. These are available as supplementary data from the GEO series associated with the data set:
> Meta(gds)$reference_series [1] "GSE7114" > rawlist = getGEOSuppFiles("GSE7114", makeDirectory = TRUE, baseDir = getwd())
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Training Units