Tutorial: Organizing workflows I

From BITS wiki
Jump to: navigation, search
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Tutorials

Collect commands

So far, we have split up the individual steps of the workflow into rather small pieces, but really, it would be good to have everything in one place. Below you see a draft for the analysis flow of the estrogen data, collected from the individual tutorials and generously explaied (everything after # is comment):

## This R script shows a workflow for loading and processing the 
##   Bioconductor estrogen data set (8 chips from a 2x2 factorial design)
##
## Main effects: high vs low estrogen only
##
## Alexander.Ploner@ki.se 2010-10-16

## Set workdir
WORKDIR = "~/bitsCourse/Data/Estrogen"
setwd(WORKDIR)

## The location of the cel files
estrodir = system.file("extdata", package="estrogen")

## Discard one bad chip
library(affy)
estrofiles = list.celfiles(estrodir)[2:9]

## Read probe-level data
estroraw = ReadAffy(filenames=estrofiles, celfile.path=estrodir)
## Read phenotypic data
estrodat = read.AnnotatedDataFrame("estrogen.txt", path=estrodir, sep="")
## Make sure that expressions and phenodata have same order
phenoData(estroraw) = estrodat[sampleNames(estroraw),]

## Compute RMA expression values
estroRMA = rma(estroraw)

## Compare distributions of probes/probesets before/after
boxplot(estroraw)
boxplot(exprs(estroRMA))

## Compare estrogen vs no estrogen
library(limma)  
design = model.matrix(~estrogen, data=pData(estroRMA))
fit    = lmFit(estroRMA, design)  
efit   = eBayes(fit)   
estroAll = topTable(efit, coef=2, num=nrow(estroRMA))
estroTop = topTable(efit, coef=2, num=nrow(estroRMA), p.value=0.05)

## MA plot
plot(logFC ~ AveExpr, data=estroAll, pch=".")
points(logFC ~ AveExpr, data=estroTop, pch=19, col="red")

## Annotate and export the results
library(annaffy)
annCols = aaf.handler()[c(1:3, 7:9, 12)]
annTab  = aafTableAnn(estroTop$ID, "hgu95av2.db", annCols)
estroAnnTab  = aafTable(logFC=round(estroTop$logFC, 2), tstat=round(estroTop$t, 1), adjPval=signif(estroTop$adj.P.Val, 1))
estroAnnTab = merge(estroAnnTab, annTab)
saveHTML(estroAnnTab, "Estrogen_Pres-Abs.html", title = "Estrogen Present-Absent")

This I can save in a text file called, say, wfEstrogen.R and store with the data. This is a nice way of keeping the commands together and being able to re-run a specific analysis some time down the road

Writing a script

However, with the code arranged as above, we would still need to copy & paste the commands from the text file to the R console, otherwise each plot would just be overwritten by the next one, and we would only see the last one.

An easy way of avoiding this is to write the graphs directly to a file. R offers a wide variety of different file formats, both bitmap (jpeg, png) and vector-based (postscript, pdf, wmf). Personally, I like pdfs:

## Volcano plot
pdf("Volcanoplot.pdf")
plot(-log10(adj.P.Val) ~ logFC, data=estroAll, pch=".")
points(-log10(adj.P.Val) ~ logFC, data=estroTop, pch=19, col="red")
dev.off()

Important: Once you're done with plotting to a specific file, you have to release it by closing the device through dev.off(). Otherwise, R will still be waiting for further plotting commands.

Another point to make here is that we have the same or very similar sequences of commands over and over and over again. It would be great if we could easily adapt these workflows for different data sets and problems. One way of doing this is to put all the stuff that needs to be changed in one place, or at least close together, or at least highlight it properly in the text. Also aesthetically, it would be nice if the variable names were a bit more neutral and less estrogen-biased.

By putting this together, we could get something like this:

## This R script shows a workflow for loading and processing the 
##   Bioconductor estrogen data set (8 chips from a 2x2 factorial design)
##
## Main effects: high vs low estrogen only
##
## Alexander.Ploner@ki.se 2010-10-16

## Set up of parameters: working dir, cutoff for limma
WORKDIR  = "~/bitsCourse/Data/Estrogen"               ## Where results go  
maxPval  = 0.05                                       ## cutoff for DE
DATADIR  = system.file("extdata", package="estrogen") ## dir with cel, phenodat
PHENOFIL = "estrogen.txt"                             ## phenodat filename
ANNDB    = "hgu95av2.db"
CANDLIST_TITLE = "Estrogen: Absent vs Present"        ## title of html output


## Start
setwd(WORKDIR)

## Read the CEL files
library(affy)
## CHANGE ME: we discard one bad chip
celfiles = list.celfiles(DATADIR)[2:9]

## Read probe-level data
rawdata = ReadAffy(filenames=celfiles, celfile.path=DATADIR)
## Read phenotypic data
phenodat = read.AnnotatedDataFrame(PHENOFIL, path=DATADIR, sep="")
## Make sure that expressions and phenodata have same order
phenoData(rawdata) = phenodat[sampleNames(rawdata),]

## Compute RMA expression values
esetRMA = rma(rawdata)

## Compare distributions of probes/probesets before/after
pdf("boxplotProbes.pdf")
boxplot(rawdata)
dev.off()
pdf("boxplotRMA.pdf")
boxplot(exprs(esetRMA))
dev.off()

## CHANGE ME: model matrix for limma
## Compare estrogen vs no estrogen
design = model.matrix(~estrogen, data=pData(esetRMA))

## Fit the models
library(limma)  
fit    = lmFit(esetRMA, design)  
efit   = eBayes(fit)   
glAll = topTable(efit, coef=2, num=Inf)
glTop = topTable(efit, coef=2, num=Inf, p.value=maxPval)

## MA plot
pdf("MAplot.pdf")
plot(logFC ~ AveExpr, data=glAll, pch=".")
points(logFC ~ AveExpr, data=glTop, pch=19, col="red")
dev.off()

## Annotate and export the results
library(annaffy)
annCols = aaf.handler()[c(1:3, 7:9, 12)]
annTab  = aafTableAnn(glTop$ID, ANNDB, annCols)
glAnnTab  = aafTable(logFC=round(glTop$logFC, 2), tstat=round(glTop$t, 1), adjPval=signif(glTop$adj.P.Val, 1))
glAnnTab = merge(glAnnTab, annTab)
saveHTML(glAnnTab, "candidate_probesets.html", title = CANDLIST_TITLE)

Advantages in using scripts

  1. (Potentially) well-documented record of analysis
  2. Easily modifiable
  3. Runs directly from the R command line as
> source("~/bitsCourse/Workflows/wfEstrogen.R")
  1. Runs directly from the OS command lines as
R CMD BATCH wfEstrogen.R

(provided the path variable is set correctly and wfEstrogen.R is in the current directory).