PubMA Exercise.2b

From BITS wiki
Jump to: navigation, search

Follow-up analysis demo in RStudio

RStudio.2x.png

[ Main_Page | PubMA_Exercise.1 | PubMA_Exercise.2 |
| PubMA_Exercise.3 ]


Introduction

The former exercise produced a full table with differential expression between Heart and Diaphragm samples that can be used directly with Functional enrichment tools or IPA. The R-script used by GEO2R is quite standard and basic and only provides BoxPlots for signal distribution. This is a minimum when doing MA data analysis and users often appreciate to have some more QC done on the data to control for biases or inconsistencies associated with artifacts or with variability in the experiment.

Expert analyst makes use of the [R]-Bioconductor toolbox to further analyze their data. Some examples of standard R functions are shown below to show you the power of this programing language. The following exercise is provided as an appetizer and is far not exhaustive, you will find many more methods and tools by exploring the CRAN documentation (http://cran.r-project.org[1])

installing on a non-BITS laptop

required for non-BITS laptops

# If running a unix OS computer please install 'libxml2-devel and libcurl-devel' required to build dependencies
 
# install minimal bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite()
 
# Add required packages for today's session (will add RCurl, XML, ... dependencies)
## biocLite(c("Biobase", "GEOquery", "limma",  "affy"))
 
# some other packages are not required here but are required for RobiNA in other exercises
# these required packages were identified by parsing all R scripts present in the windows build of Robina
## biocLite (c("affy", "affyPLM",  "RankProd", "limma", "gcrma", "statmod", "marray", "plier", "makecdfenv", "Biostrings", 
    "edgeR", "DESeq", "EDASeq"))

 

Extend the GEO2R analysis in R with RStudio

RStudio is the current best graphical environment to program i the [R] language and offers many facilities to learn and develop your R skills. The program is available free of charge from http://www.rstudio.com[2].

adapt the GEO2R script in RStudio

Original GEO2R code

The starting script is first reproduced here

initial GEO2R code

# Version info: R 2.14.1, Biobase 2.15.3, GEOquery 2.23.2, limma 3.10.1
# R scripts generated  Tue Aug 12 05:30:54 EDT 2014
 
################################################################
#   Differential expression analysis with limma
library(Biobase)
library(GEOquery)
library(limma)
 
# load series and platform data from GEO
## gset <- getGEO("GSE6943", GSEMatrix =TRUE)
gset <- getGEO("GSE6943", GSEMatrix =TRUE)
if (length(gset) > 1) idx <- grep("GPL341", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
 
# make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))
 
# group names for all samples
sml <- c("G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1");
 
# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
  (qx[6]-qx[1] > 50 && qx[2] > 0) ||
  (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
            exprs(gset) <- log2(ex) }
 
# set up the data and proceed with analysis
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)
 
# load NCBI platform annotation
gpl <- annotation(gset)
## platf <- getGEO(gpl, AnnotGPL=TRUE)
platf <- getGEO(gpl, AnnotGPL=TRUE)
ncbifd <- data.frame(attr(dataTable(platf), "table"))
 
# replace original platform annotation
tT <- tT[setdiff(colnames(tT), setdiff(fvarLabels(gset), "ID"))]
tT <- merge(tT, ncbifd, by="ID")
tT <- tT[order(tT$P.Value), ]  # restore correct order
 
tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title"))
write.table(tT, file=stdout(), row.names=F, sep="\t")
 
################################################################
#   Boxplot for selected GEO samples
library(Biobase)
library(GEOquery)
 
# load series and platform data from GEO
 
## gset <- getGEO("GSE6943", GSEMatrix=TRUE)
gset <- getGEO("GSE6943", GSEMatrix=TRUE)
if (length(gset) > 1) idx <- grep("GPL341", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
 
# group names for all samples in a series
sml <- c("G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1")
 
# order samples by group
ex <- exprs(gset)[ , order(sml)]
sml <- sml[order(sml)]
fl <- as.factor(sml)
labels <- c("Diaphragm","Heart")
 
# set parameters and draw the plot
palette(c("#dfeaf4","#f4dfdf", "#AABBCC"))
dev.new(width=4+dim(gset)[[2]]/5, height=6)
par(mar=c(2+round(max(nchar(sampleNames(gset)))/2),4,2,1))
title <- paste ("GSE6943", '/', annotation(gset), " selected samples", sep ='')
boxplot(ex, boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=fl)
legend("topleft", labels, fill=palette(), bty="n")

Improved code

The next script was adapted to keep local files and to save results to file instead of sanding them to a new browser window. Edits are shown where original lines are commented out by '##'

  • 'destdir = base' saves the downloads to the local folder instead of to the temp directory
  • 'number=nrow(fit2)' generates the full table instead of the top250

We now show the edited script where lines starting with ## are modified in the next line(s)

## added configuration
base <- "~/PUBMA2014/ex2b-files", sep="")
setwd(base)
 
# Version info: R 2.14.1, Biobase 2.15.3, GEOquery 2.23.2, limma 3.10.1
# R scripts generated  Tue Aug 12 05:30:54 EDT 2014
 
################################################################
#   Differential expression analysis with limma
library(Biobase)
library(GEOquery)
library(limma)
 
# load series and platform data from GEO
## gset <- getGEO("GSE6943", GSEMatrix =TRUE)
gset <- getGEO("GSE6943", GSEMatrix =TRUE, destdir = base)
if (length(gset) > 1) idx <- grep("GPL341", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
 
# make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))
 
# group names for all samples
sml <- c("G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1");
 
# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
          (qx[6]-qx[1] > 50 && qx[2] > 0) ||
          (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
  exprs(gset) <- log2(ex) }
 
# set up the data and proceed with analysis
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
 
## tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)
tT <-  topTable(fit2, adjust="fdr", sort.by="B", number=nrow(fit2))
 
# load NCBI platform annotation
gpl <- annotation(gset)
## platf <- getGEO(gpl, AnnotGPL=TRUE)
platf <- getGEO(gpl, AnnotGPL=TRUE, destdir = base)
ncbifd <- data.frame(attr(dataTable(platf), "table"))
 
# replace original platform annotation
tT <- tT[setdiff(colnames(tT), setdiff(fvarLabels(gset), "ID"))]
tT <- merge(tT, ncbifd, by="ID")
tT <- tT[order(tT$P.Value), ]  # restore correct order
 
## tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title"))
tT.final <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title"))
## write.table(tT.final, file=stdout(), row.names=F, sep="\t")
write.table(tT.final, file="GSE6943_DE.txt", row.names=F, sep="\t", quote=FALSE)
################################################################
#   Boxplot for selected GEO samples
library(Biobase)
library(GEOquery)
 
# load series and platform data from GEO
 
## gset <- getGEO("GSE6943", GSEMatrix=TRUE)
gset <- getGEO("GSE6943", GSEMatrix=TRUE, destdir = base)
if (length(gset) > 1) idx <- grep("GPL341", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
 
# group names for all samples in a series
sml <- c("G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1")
 
# order samples by group
ex <- exprs(gset)[ , order(sml)]
sml <- sml[order(sml)]
fl <- as.factor(sml)
labels <- c("Diaphragm","Heart")
 
# set parameters and draw the plot
## save to file
filename <- paste(base, "GSE6943-boxplot.pdf", sep="/")
pdf(file = filename, bg = "white")
palette(c("#dfeaf4", "#f4dfdf", "#AABBCC"))
## dev.new(width=4+dim(gset)[[2]]/5, height=6)
par(mar=c(2+round(max(nchar(sampleNames(gset)))/2),4,2,1))
title <- paste ("GSE6943", '/', annotation(gset), " sample signal distribution", sep ='')
boxplot(ex, boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=fl)
legend("topleft", labels, fill=palette(), bty="n")
dev.off()
## save R workspace for reuse
outfile <- paste(base, "Workspace.RData", sep="/")
save.image(file = outfile)
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] GEOquery_2.31.1     Biobase_2.25.0      BiocGenerics_0.11.4 limma_3.21.12      

loaded via a namespace (and not attached):
[1] RCurl_1.95-4.3 tools_3.1.1    XML_3.98-1.1

download exercise files

Download exercise files here.

Use the right application to open the files present in ex2b-files

References:
  1. http://cran.r-project.org
  2. http://www.rstudio.com

[ Main_Page | PubMA_Exercise.1 | PubMA_Exercise.2 |
| PubMA_Exercise.3 ]