Tutorial: Organizing workflows I
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Tutorials
Collect commands
So far, we have split up the individual steps of the workflow into rather small pieces, but really, it would be good to have everything in one place. Below you see a draft for the analysis flow of the estrogen data, collected from the individual tutorials and generously explaied (everything after #
is comment):
## This R script shows a workflow for loading and processing the ## Bioconductor estrogen data set (8 chips from a 2x2 factorial design) ## ## Main effects: high vs low estrogen only ## ## Alexander.Ploner@ki.se 2010-10-16 ## Set workdir WORKDIR = "~/bitsCourse/Data/Estrogen" setwd(WORKDIR) ## The location of the cel files estrodir = system.file("extdata", package="estrogen") ## Discard one bad chip library(affy) estrofiles = list.celfiles(estrodir)[2:9] ## Read probe-level data estroraw = ReadAffy(filenames=estrofiles, celfile.path=estrodir) ## Read phenotypic data estrodat = read.AnnotatedDataFrame("estrogen.txt", path=estrodir, sep="") ## Make sure that expressions and phenodata have same order phenoData(estroraw) = estrodat[sampleNames(estroraw),] ## Compute RMA expression values estroRMA = rma(estroraw) ## Compare distributions of probes/probesets before/after boxplot(estroraw) boxplot(exprs(estroRMA)) ## Compare estrogen vs no estrogen library(limma) design = model.matrix(~estrogen, data=pData(estroRMA)) fit = lmFit(estroRMA, design) efit = eBayes(fit) estroAll = topTable(efit, coef=2, num=nrow(estroRMA)) estroTop = topTable(efit, coef=2, num=nrow(estroRMA), p.value=0.05) ## MA plot plot(logFC ~ AveExpr, data=estroAll, pch=".") points(logFC ~ AveExpr, data=estroTop, pch=19, col="red") ## Annotate and export the results library(annaffy) annCols = aaf.handler()[c(1:3, 7:9, 12)] annTab = aafTableAnn(estroTop$ID, "hgu95av2.db", annCols) estroAnnTab = aafTable(logFC=round(estroTop$logFC, 2), tstat=round(estroTop$t, 1), adjPval=signif(estroTop$adj.P.Val, 1)) estroAnnTab = merge(estroAnnTab, annTab) saveHTML(estroAnnTab, "Estrogen_Pres-Abs.html", title = "Estrogen Present-Absent")
This I can save in a text file called, say, wfEstrogen.R
and store with the data. This is a nice way of keeping the commands together and being able to re-run a specific analysis some time down the road
Writing a script
However, with the code arranged as above, we would still need to copy & paste the commands from the text file to the R console, otherwise each plot would just be overwritten by the next one, and we would only see the last one.
An easy way of avoiding this is to write the graphs directly to a file. R
offers a wide variety of different file formats, both bitmap (jpeg, png) and vector-based (postscript, pdf, wmf). Personally, I like pdfs:
## Volcano plot pdf("Volcanoplot.pdf") plot(-log10(adj.P.Val) ~ logFC, data=estroAll, pch=".") points(-log10(adj.P.Val) ~ logFC, data=estroTop, pch=19, col="red") dev.off()
Important: Once you're done with plotting to a specific file, you have to release it by closing the device through dev.off()
. Otherwise, R
will still be waiting for further plotting commands.
Another point to make here is that we have the same or very similar sequences of commands over and over and over again. It would be great if we could easily adapt these workflows for different data sets and problems. One way of doing this is to put all the stuff that needs to be changed in one place, or at least close together, or at least highlight it properly in the text. Also aesthetically, it would be nice if the variable names were a bit more neutral and less estrogen-biased.
By putting this together, we could get something like this:
## This R script shows a workflow for loading and processing the ## Bioconductor estrogen data set (8 chips from a 2x2 factorial design) ## ## Main effects: high vs low estrogen only ## ## Alexander.Ploner@ki.se 2010-10-16 ## Set up of parameters: working dir, cutoff for limma WORKDIR = "~/bitsCourse/Data/Estrogen" ## Where results go maxPval = 0.05 ## cutoff for DE DATADIR = system.file("extdata", package="estrogen") ## dir with cel, phenodat PHENOFIL = "estrogen.txt" ## phenodat filename ANNDB = "hgu95av2.db" CANDLIST_TITLE = "Estrogen: Absent vs Present" ## title of html output ## Start setwd(WORKDIR) ## Read the CEL files library(affy) ## CHANGE ME: we discard one bad chip celfiles = list.celfiles(DATADIR)[2:9] ## Read probe-level data rawdata = ReadAffy(filenames=celfiles, celfile.path=DATADIR) ## Read phenotypic data phenodat = read.AnnotatedDataFrame(PHENOFIL, path=DATADIR, sep="") ## Make sure that expressions and phenodata have same order phenoData(rawdata) = phenodat[sampleNames(rawdata),] ## Compute RMA expression values esetRMA = rma(rawdata) ## Compare distributions of probes/probesets before/after pdf("boxplotProbes.pdf") boxplot(rawdata) dev.off() pdf("boxplotRMA.pdf") boxplot(exprs(esetRMA)) dev.off() ## CHANGE ME: model matrix for limma ## Compare estrogen vs no estrogen design = model.matrix(~estrogen, data=pData(esetRMA)) ## Fit the models library(limma) fit = lmFit(esetRMA, design) efit = eBayes(fit) glAll = topTable(efit, coef=2, num=Inf) glTop = topTable(efit, coef=2, num=Inf, p.value=maxPval) ## MA plot pdf("MAplot.pdf") plot(logFC ~ AveExpr, data=glAll, pch=".") points(logFC ~ AveExpr, data=glTop, pch=19, col="red") dev.off() ## Annotate and export the results library(annaffy) annCols = aaf.handler()[c(1:3, 7:9, 12)] annTab = aafTableAnn(glTop$ID, ANNDB, annCols) glAnnTab = aafTable(logFC=round(glTop$logFC, 2), tstat=round(glTop$t, 1), adjPval=signif(glTop$adj.P.Val, 1)) glAnnTab = merge(glAnnTab, annTab) saveHTML(glAnnTab, "candidate_probesets.html", title = CANDLIST_TITLE)
Advantages in using scripts
- (Potentially) well-documented record of analysis
- Easily modifiable
- Runs directly from the
R
command line as
> source("~/bitsCourse/Workflows/wfEstrogen.R")
- Runs directly from the OS command lines as
R CMD BATCH wfEstrogen.R
(provided the path variable is set correctly and wfEstrogen.R
is in the current directory).