Recapitulation of the Atxn1 example

From BITS wiki
Jump to: navigation, search
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Training Units

Background

The experiment with reference E-MEXP-886 compares global gene expression in five mice that are ko for Atxn1 with five wt mice.

Reading in the data

We do this the hard way, to show what this step would look like for self-generated data. Otherwise, there are much easier ways to download data from ArrayExpress (see below).

Organizing the files

We download expression data and sample description from [1]:

  • E-MEXP-886.raw.1.zip (celfiles, need to be unzipped)
  • E-MEXP-886.sdrf.txt (sample description)

It is highly preferable to store all the data (celfiles and sample description) in a separate directory. In my case, this would be ~/bitsCourse/Data/Atxn1/.

Once everything is in place, we start R and change the working directory to the new data directory:

> setwd("~/bitsCourse/Data/Atxn1/")

Important: if you specify directory paths for Windows, you need to double the backslashes for R to understand what you are saying, .i.e C:\\temp instead of C:\temp.

Reading expression data

Simply using ReadAffy without arguments will read all celfiles in the current working directory:

> library(affy)
> rawdata = ReadAffy()

Notice that the samples in the new object rawdata are in the same order as the files on the harddisk. This may be different for different operating systems.

> sampleNames(rawdata)
[1] "1307.cel"  "1364.cel"  "1749r.CEL" "1750.cel"  "1751.cel"  "1753.cel" 
[7] "1756.cel"  "1863.cel"  "1869.cel"  "1919.cel" 

Reading phenotypic data

Generally, when we have sample description data (phenotypic data) available as tab-delimited text file or Excel or similar, we want to use either read.AnnotatedDataFram or simple read.table to read in the file. In this case however, the sample description file is not very practical: large (more than 120 columns) and the rows in different order than the files.

Given that we are olny interested in the genotype of the samples, it is actually easiest to specify the sample information at the command line as below. The important thing (where we need to rely on the sample description) is to get the order of the samples/genotypes as for the raw data:

> genotype = c("ko", "wt", "ko", "ko", "ko", "wt", "wt", "wt", "wt","ko")

Now we can wrap this into a data frame and add the information to the raw data:

> phenodata = data.frame(genotype)
> rownames(phenodata) = sampleNames(rawdata)
> colnames(phenodata) = "Genotype"
> pData(rawdata) = phenodata

Pre-processing

We calculate RMA expression values. This is now easy:

> rmaexp = rma(rawdata)

Identify differentially expressed genes

Specify the grouping variable

First we have to state explicitly which is the grouping of samples that we want to use for comparison:

> design = model.matrix( ~ Genotype, data=pData(rmaexp))

Fit the gene-wise models

We load the package limma and perform the two separate fitting steps:

> library(limma)
> fit  = lmFit(rmaexp, design)
> efit = eBayes(fit)

Extract a list of candidate genes

From the fitted model, we can now extract the list of top-regulated genes. By default, just the top ten probesets are reported, but we can specify more via the argument number to the topTable function.

> topGenes = topTable(efit, coef=2)
> topGenes
                ID      logFC   AveExpr         t      P.Value adj.P.Val  
17150   1449038_at -0.6478137 10.824582 -6.203841 0.0000405525 0.9201363 
741     1416410_at -0.6669091  7.786242 -5.332340 0.0001632216 0.9997648 
1396    1417065_at  0.5419626  9.357026  4.531648 0.0006436661 0.9997648 
22354   1460330_at -0.3793695  7.131633 -4.379032 0.0008440473 0.9997648 
1284    1416953_at -0.4074768  7.201906 -4.217294 0.0011283071 0.9997648 
14355   1435386_at -0.4686320  6.385330 -4.058502 0.0015046330 0.9997648 
13864   1434195_at -0.3602817  5.362057 -3.937945 0.0018754074 0.9997648 
6174    1421979_at -0.4553935  4.647804 -3.899021 0.0020142525 0.9997648 
20468 1452405_x_at -0.2901493  7.552847 -3.890803 0.0020448945 0.9997648 
9746    1425551_at  0.3204928  7.201298  3.708317 0.0028637143 0.9997648 

While we get apparently reasonably small p-values for the top genes, these all but vanish after the adjustment for multiple testing (last column here). Even if we have rather small sample sizes here, this is somewhat disappointing.

Export an annotated gene list

We still decide to have a closer look at the top-regulated genes - maybe some interesting biology comes up? The package annaffy provides functionality for building annotated gene lists for export to HTML or tex files.

> library(annaffy)

annaffy offers a wide range of so-called handlers for adding different kinds of annotation. Here we decide on the following subset:

> annCols = aaf.handler()[c(1:3, 7:9, 12)]
> annCols
[1] "Probe"       "Symbol"      "Description" "Gene"        "Cytoband"   
[6] "UniGene"     "Pathway

We use the Affymterix probeset IDs from the table of top-regulated genes to extract this information from the annotation package for this specific chip (moe430a):

> annTab  = aafTableAnn(topGenes$ID, "moe430a.db", annCols)

We build a corresponding table of statistics from our analysis, including the log fold change, the t-statistic, the raw p-value and the adjusted p-value (which happens to be the false discovery rate or FDR).

> annTab2  = aafTable(logFC=round(topGenes$logFC, 2), tstat=round(topGenes$t, 1), Pval=signif(topGenes$P.Value, 1), FDR=signif(topGenes$adj.P.Val, 2))

Note that we round the numerical results so we don't drown in digits.

Finally, we merge the two tables and export them to a HTML file with a nice name and title:

> annTab = merge(annTab2, annTab)
> saveHTML(annTab, "Genelist Wt-Ko.html", title = "Ataxin1: Wt-Ko")

You can see the resulting file here.

Saving results

Depending on the next steps, we may want to save some or all of our results to the harddisk. We can e.g. save only the raw and processed expression data:

save(rawdata, rmaexp, file="Atxn1Expr.RData")

Or we decide to save everything:

> save.image("Atxn1Results.RData")

I generally do not recommend simply saving the whole workspace to the default .RData file, as can get difficult to manage many anonymous files at the same time (YMMV).

As a script file

The problem now is that with the super-didactic approach taken on these pages, it appears that the whole problem is broken down into a million little pieces, each with its own commands and problems. In practice, life becomes easier through using and re-using commented pieces of code that perform a specific task. Once established and understood, it is significantly easier to adpat these templates (or script files, or wokflows) for different data sets or different kinds of analysis.

As an example, I have wrapped the commands on this page into a such a workflow here. This a simple text file that can be modified in any text editor (e.g. Notepad).

Comments

Common problems

Most problems occurr when reading in the data. If you run into trouble, you may want to check the following frequent problems:

  1. Wrong order of samples
  2. No double backslashes in the path name

Using ArrayExpress

All the reading in can simly be done in one step by using the package ArrayExpress, which also incorporates all the meta information correctly:

> library(ArrayExpress)
> Atxn1Data = ArrayExpress("E-MEXP-886")
>sampleNames(Atxn1Data)
 [1] "1863.CEL"  "1919.CEL"  "1869.CEL"  "1749r.CEL" "1753.CEL"  "1307.CEL" 
 [7] "1750.CEL"  "1364.CEL"  "1751.CEL"  "1756.CEL" 
> pData(Atxn1Data)$Characteristics..Genotype.
 [1] "wild_type"    "ataxin 1 -/-" "wild_type"    "ataxin 1 -/-" "wild_type"   
 [6] "ataxin 1 -/-" "ataxin 1 -/-" "wild_type"    "ataxin 1 -/-" "wild_type"