Recapitulation of the Atxn1 example
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Training Units
Contents
Background
The experiment with reference E-MEXP-886 compares global gene expression in five mice that are ko for Atxn1 with five wt mice.
Reading in the data
We do this the hard way, to show what this step would look like for self-generated data. Otherwise, there are much easier ways to download data from ArrayExpress (see below).
Organizing the files
We download expression data and sample description from [1]:
-
E-MEXP-886.raw.1.zip
(celfiles, need to be unzipped) -
E-MEXP-886.sdrf.txt
(sample description)
It is highly preferable to store all the data (celfiles and sample description) in a separate directory. In my case, this would be ~/bitsCourse/Data/Atxn1/
.
Once everything is in place, we start R
and change the working directory to the new data directory:
> setwd("~/bitsCourse/Data/Atxn1/")
Important: if you specify directory paths for Windows, you need to double the backslashes for R to understand what you are saying, .i.e C:\\temp
instead of C:\temp
.
Reading expression data
Simply using ReadAffy
without arguments will read all celfiles in the current working directory:
> library(affy) > rawdata = ReadAffy()
Notice that the samples in the new object rawdata
are in the same order as the files on the harddisk. This may be different for different operating systems.
> sampleNames(rawdata) [1] "1307.cel" "1364.cel" "1749r.CEL" "1750.cel" "1751.cel" "1753.cel" [7] "1756.cel" "1863.cel" "1869.cel" "1919.cel"
Reading phenotypic data
Generally, when we have sample description data (phenotypic data) available as tab-delimited text file or Excel or similar, we want to use either read.AnnotatedDataFram
or simple read.table
to read in the file. In this case however, the sample description file is not very practical: large (more than 120 columns) and the rows in different order than the files.
Given that we are olny interested in the genotype of the samples, it is actually easiest to specify the sample information at the command line as below. The important thing (where we need to rely on the sample description) is to get the order of the samples/genotypes as for the raw data:
> genotype = c("ko", "wt", "ko", "ko", "ko", "wt", "wt", "wt", "wt","ko")
Now we can wrap this into a data frame and add the information to the raw data:
> phenodata = data.frame(genotype) > rownames(phenodata) = sampleNames(rawdata) > colnames(phenodata) = "Genotype" > pData(rawdata) = phenodata
Pre-processing
We calculate RMA expression values. This is now easy:
> rmaexp = rma(rawdata)
Identify differentially expressed genes
Specify the grouping variable
First we have to state explicitly which is the grouping of samples that we want to use for comparison:
> design = model.matrix( ~ Genotype, data=pData(rmaexp))
Fit the gene-wise models
We load the package limma
and perform the two separate fitting steps:
> library(limma) > fit = lmFit(rmaexp, design) > efit = eBayes(fit)
Extract a list of candidate genes
From the fitted model, we can now extract the list of top-regulated genes. By default, just the top ten probesets are reported, but we can specify more via the argument number
to the topTable
function.
> topGenes = topTable(efit, coef=2) > topGenes ID logFC AveExpr t P.Value adj.P.Val 17150 1449038_at -0.6478137 10.824582 -6.203841 0.0000405525 0.9201363 741 1416410_at -0.6669091 7.786242 -5.332340 0.0001632216 0.9997648 1396 1417065_at 0.5419626 9.357026 4.531648 0.0006436661 0.9997648 22354 1460330_at -0.3793695 7.131633 -4.379032 0.0008440473 0.9997648 1284 1416953_at -0.4074768 7.201906 -4.217294 0.0011283071 0.9997648 14355 1435386_at -0.4686320 6.385330 -4.058502 0.0015046330 0.9997648 13864 1434195_at -0.3602817 5.362057 -3.937945 0.0018754074 0.9997648 6174 1421979_at -0.4553935 4.647804 -3.899021 0.0020142525 0.9997648 20468 1452405_x_at -0.2901493 7.552847 -3.890803 0.0020448945 0.9997648 9746 1425551_at 0.3204928 7.201298 3.708317 0.0028637143 0.9997648
While we get apparently reasonably small p-values for the top genes, these all but vanish after the adjustment for multiple testing (last column here). Even if we have rather small sample sizes here, this is somewhat disappointing.
Export an annotated gene list
We still decide to have a closer look at the top-regulated genes - maybe some interesting biology comes up? The package annaffy
provides functionality for building annotated gene lists for export to HTML or tex files.
> library(annaffy)
annaffy
offers a wide range of so-called handlers for adding different kinds of annotation. Here we decide on the following subset:
> annCols = aaf.handler()[c(1:3, 7:9, 12)] > annCols [1] "Probe" "Symbol" "Description" "Gene" "Cytoband" [6] "UniGene" "Pathway
We use the Affymterix probeset IDs from the table of top-regulated genes to extract this information from the annotation package for this specific chip (moe430a):
> annTab = aafTableAnn(topGenes$ID, "moe430a.db", annCols)
We build a corresponding table of statistics from our analysis, including the log fold change, the t-statistic, the raw p-value and the adjusted p-value (which happens to be the false discovery rate or FDR).
> annTab2 = aafTable(logFC=round(topGenes$logFC, 2), tstat=round(topGenes$t, 1), Pval=signif(topGenes$P.Value, 1), FDR=signif(topGenes$adj.P.Val, 2))
Note that we round the numerical results so we don't drown in digits.
Finally, we merge the two tables and export them to a HTML file with a nice name and title:
> annTab = merge(annTab2, annTab) > saveHTML(annTab, "Genelist Wt-Ko.html", title = "Ataxin1: Wt-Ko")
You can see the resulting file here.
Saving results
Depending on the next steps, we may want to save some or all of our results to the harddisk. We can e.g. save only the raw and processed expression data:
save(rawdata, rmaexp, file="Atxn1Expr.RData")
Or we decide to save everything:
> save.image("Atxn1Results.RData")
I generally do not recommend simply saving the whole workspace to the default .RData
file, as can get difficult to manage many anonymous files at the same time (YMMV).
As a script file
The problem now is that with the super-didactic approach taken on these pages, it appears that the whole problem is broken down into a million little pieces, each with its own commands and problems. In practice, life becomes easier through using and re-using commented pieces of code that perform a specific task. Once established and understood, it is significantly easier to adpat these templates (or script files, or wokflows) for different data sets or different kinds of analysis.
As an example, I have wrapped the commands on this page into a such a workflow here. This a simple text file that can be modified in any text editor (e.g. Notepad).
Comments
Common problems
Most problems occurr when reading in the data. If you run into trouble, you may want to check the following frequent problems:
- Wrong order of samples
- No double backslashes in the path name
Using ArrayExpress
All the reading in can simly be done in one step by using the package ArrayExpress
, which also incorporates all the meta information correctly:
> library(ArrayExpress) > Atxn1Data = ArrayExpress("E-MEXP-886") >sampleNames(Atxn1Data) [1] "1863.CEL" "1919.CEL" "1869.CEL" "1749r.CEL" "1753.CEL" "1307.CEL" [7] "1750.CEL" "1364.CEL" "1751.CEL" "1756.CEL" > pData(Atxn1Data)$Characteristics..Genotype. [1] "wild_type" "ataxin 1 -/-" "wild_type" "ataxin 1 -/-" "wild_type" [6] "ataxin 1 -/-" "ataxin 1 -/-" "wild_type" "ataxin 1 -/-" "wild_type"