PubMA Exercise.4

From BITS wiki
Jump to: navigation, search

RobiNA analysis

RobiNA.png

[ Main_Page | Hands-on Analysis of public microarray datasets | PubMA_Exercise.3 | PubMA_Exercise.4 |
| PubMA_Exercise.5 ]


Introduction

" Several commercial programs such as GeneSpring featuring a rich and simple user interface to statistically analyze high-throughput "omics" data are available. However, these are usually very expensive and might even require an annual subscription. On the other hand, there are free, open source tools such as BioConductor which offer great statistical options and support for free, but this power often comes at the price of usability, since these tools often feature either only a command line interface or solely very basic user interaction.

Finally, there are tools such as RMA Express which offer a rich user interface but only a very limited set of options.

RobiNA tries to bridge this gap by providing a flexible user friendly graphical interface to unleash the power of R/BioConductor for the individual biologist. RobiNA comes as a convenient all-in-one installation package, that automatically installs the application itself plus all required external tools (i.e. the R and BioConductor frameworks and bowtie). "[1]

Technical.png Although RobiNA can handle several data type including two color microarray and even RNASeq data, we here provide a simple tutorial to perform QC and differential analysis of Affymetrix microarray data comparable to what can be achieved using the CLC Main workbench

 

Please see the RobiNA quick guide and Robin and RobiNA user's manual which contain step-by-step walk through and detailed information. RobinA can be downloaded from http://mapman.gabipd.org/web/guest/robin

 

Required before starting RobiNA

RobiNA working great ... or NOT

Technical.png RobiNA should work under Windows, Mac, and Unix as it uses Java which is a universal language. However, we have found that some computers and operating systems are refractory to RobiNA and even with 6GB RAM may have issues running RobiNA with as few as 10 CEL files. RobiNA requires a recent version of Java JDK (you can obtain JDK from: [1]). The RobiNA developers do not actively support the software right now and you will need to try things by yourself if you have issues

 

The CDF annotation file

RobiNA needs a CDF file to work. CDF files are complex text tables allowing the identification of the probes and genes associated with the chip spots, such files are sometimes difficult to find. The RobinA software was developed by a plant groups and has plant as main focus and does not include non-plant annotations. For those who want to analyze other living organisms, the microarray annotation file corresponding to the used chip (CDF) should be obtained before starting RobiNA. A place where to find Affymetrix CDF files should be the company website but it is often difficult to locate the CDF among the long lists of available Affymetrix resources (http://www.affymetrix.com/estore/ | free registration required)[2].

The easiest way to obtain the correct Affymetrix CDF file is probably by using the Affymetrix Expression Console.

Technical.png the Affymetrix Expression Console software is only available for Windows and downloading/using it requires a free user registration

After installing and starting the Affymetrix Expression Console program, Use the File menu to access the 'library' download manager

get_cdf_1_4.png

Log into the Affy support area using your credentials

get_cdf_2_4.png

Search and check the chip corresponding to your data, and let the manager download the associated files

get_cdf_3_4.png

In the example above we highlight the rat array that was already downloaded and was used to generate the demo data in the CLC Bio Main Workbench (measurements of gene expression in tissues from cardiac left ventricle and diaphragm muscle of rats; Lunteren et al., 2008).

The Affymetrix Expression Console will download the CDF file to the folder that you have specified in the library path (typically, the folder where your CEL files are stored). The file shown here as example is for Arabidopsis

get_cdf_4_4.png

You can change this folder via Edit in the top menu. Click Set library path.

get_cdf_4_5.png

 

Step by Step RobiNA analysis workflow

You can use RobiNA for...

  • quality assessment of your data
  • normalization of your microarray data
  • detection of differentially expressed genes
  • preparation of the data for an import into MapMan and/or excel
  • generation of informative plots on your experiment

Pre-requisites are the CEL files containing each hybridization data (obtained from the GEO page as a '.tar' archive and decompressed to individual CEL files) and a CDF file listing all probes present on the chip used in the experiment. Today's CDF file can be obtained from the Affymetrix site [2] after registering (free) and searching for the library file corresponding to the platform reported in the GEO pages (in our case: Rat Expression Set 230, aka RAE230A).

The decompressed archive contains the required CDF file under 'CD_RAE230_rev04/Full/RAE230A/LibFiles/RAE230A.CDF' that can be copied in the RobiNA project area.

 

start RobiNA and create a new project for results

01-startup-screen.png


02-new-project.png

 

Importing CDF & CEL files

Technical.png While preparing this training, we discovered that GEO had a damaged file (GSM160097) for one of the sample, we will therefore do this training with only 5-replicate in the Diaphragm group while the CLC analysis was done with the full data

You are now ready to import the CEL files and the matching CDF annotation database.


03-import_CDF_CEL.png

 

Performing QC on each CEL file

When QC finishes, a number of single plots can be reviewed by clicking on each section. These figures report issues or demonstrate good quality of the data and are also saved to disk for later inclusion in your reports.

Overview of all QC-Plots
04-QC-choose.png
04-QC-results1.png
04-QC-results2.png
04-QC-results3.png
04-QC-results4.png
04-QC-results5.png
04-QC-results6.png

 

Evaluating QC results

One plot of each kind is reproduced below and shows that RobiNA generates classical QC plots showing how 'good' the data is and how well it divides according to the defined groups.

Details for each QC-Plot type
robin_boxplot.png
robin_maplot1.png
# one for each sample
robin_rna.png
robin_hist.png
robin_scat12.png
<- one for each pairwise sample comparison (n^2)
robin_pca.png
# distribution of the variance across principal components and PCA plot
robin_hclust.png

 

Define design

In this part, we define contrasts and start the differential analysis. Please note that complex design are possible here by defining 'metagroups' next to the sample groups (eg. mouse background, stimulant, ...). This is not the aim of this very simplistic design comparing two tissues. Please refer to the software documentation for more detailed examples.


05-configure-groups-groups.png


06-set-design.png

 

Computing differential expression

07-load_CDF_and-compute.png

Technical.png RobiNA was developed for plants and can use annotation files from the website to add gene descriptions to the differential expression table. For non-plant data, users will need to add these annotations using external tools


08-select_annotations.png

We choose 'Skip' as we do not have annotation files for rat. The differential expression gets computed. When done, we select 'Exit' from the front window.


09_exit.png

 

Reviewing DE results

RobiNA saves a number of files to the disk in its folder structure. Text tables can be found in the 'detailed results' folder and are partly reproduced below. A short summary lists the parameters applied to perform the analysis while two tables report the full results and the top-100 results after sorting lines in decreasing adj.P.Val order. Three PNG pictures (below) report the Up-regulated gene count, DE-gene count and Down-regulated gene-count respectively.

Final plots summarize key facets of the analysis as done in classical MA analysis. The left plot shows in red probesets that are differentially expressed with the selected statistics; the right plot confirms that the samples segregate as defined in the grouping. The bottom simple venn diagrams report counts of DE probesets (Total, Up, Down).

Plots
MAplot_Heart-Diaphragm.png
PCAplot.png
vennDiagram_total.png
vennDiagram_up.png
vennDiagram_down.png

In addition to the plots, several text tables are saved in 'detailed_results' and sampled below.

Analysis summary

###################################
# Robin affymetrix data analysis summary
# RobiNA-results
#  8/18/2014; 16:35:28
###################################

# Input files:

# Normalization settings for quality control
normalization method: rma
P-value correction method: BH
analysis strategy: Limma

# Normalization settings for main analysis
normalization method: rma
P-value correction method: BH
Multiple testing strategy: nestedF
P-value cut-off value for significant differential expression: 0.05
Genes that showed a log2-fold change smaller than two ignored: yes

# The analysis produced the following warnings
none

Head of 'full_table_Heart - Diaphragm.txt'

ID          logFC              AveExpr           t                  P.Value               adj.P.Val             B
1371247_at  8.8690486752098    9.2066061992338   138.180384786338   7.72673860255015e-23  1.23032858768406e-18  40.3028038935592
1370412_at  8.00700839973324   8.66636177255196  112.671243745562   1.24362501567467e-21  9.90112056229389e-18  38.5147635203019
1387787_at  8.51384348690296   10.0136513602682  107.791218026326   2.27212516655303e-21  1.2059683009008e-17   38.0893208171574
1372195_at  9.44110458586408   9.1061642897453   100.528072562373   5.87151716284703e-21  2.01180346646277e-17  37.3935367235781
1370971_at  8.88483556476391   8.44782914943697  99.9889507276205   6.31728778013807e-21  2.01180346646277e-17  37.3386444895643
1367896_at  8.17319442370653   8.32878048386055  97.7761545656647   8.56617301567765e-21  2.27331954881059e-17  37.1083132290366
1374248_at  7.92028193041167   9.06490481778715  95.9358621573261   1.10936083087999e-20  2.52347893001459e-17  36.9103955684498
1368093_at  -8.61265104186292  8.23846334693981  -81.9875192287634  9.40341381711377e-20  1.87163197762378e-16  35.195893580546
1388139_at  8.60278366808529   8.70065810953839  80.0047250421511   1.31181945592154e-19  2.32090013295985e-16  34.9170369337463

Head of 'mean_rma_normalized_expression_values.txt'

Identifier  Heart             Diaphragm
1367452_at  9.5293566110383   9.5889443412077
1367453_at  10.1758428259477  10.1726921462891
1367454_at  8.87231561088045  8.86593395245015
1367455_at  10.4751392501094  10.3325900097593
1367456_at  10.7698324122072  10.286304116399
1367457_at  8.75165515455833  8.73159768841635
1367458_at  7.54000953569876  7.71781966096578
1367459_at  11.0420305904582  11.1519497142209
1367460_at  9.90893064678104  9.5235636384544

Head of 'raw_rma_normalized_expression_values.txt'

Identifier  GSM160089.CEL     GSM160090.CEL     GSM160091.CEL     GSM160092.CEL     GSM160093.CEL     GSM160094.CEL     GSM160095.CEL     GSM160096.CEL     GSM160098.CEL     GSM160099.CEL     GSM160100.CEL
1367452_at  9.64295883656058  9.65178912787566  9.32863584112902  9.48475646698994  9.4716451160078   9.59635427766681  9.65872553515732  9.63503499785589  9.50781431044761  9.58763738444463  9.55550947813306
1367453_at  10.1553644909771  10.2499020063438  10.1861859768652  10.0965629262007  10.159434523851   10.2076070314483  10.2269807954293  10.195253118062   10.1861859768652  10.2883928784824  9.96664796260638
1367454_at  8.73905780465407  9.08252980064849  9.07161018095027  8.68762900735943  8.81499836330383  8.83806850836661  8.88654295440762  9.05310137910709  8.68462564392101  8.8673312764484   8.83806850836661
1367455_at  10.4763615418993  10.4471239467834  10.5710504733759  10.3264804870339  10.4745349426514  10.5552841089126  10.5959387062089  10.5139881779964  10.2210244191821  10.3038610413289  10.0281377040801
1367456_at  10.9984593902943  10.8441477983143  10.673570257498   10.6700249981427  10.6750331606405  10.7577588683534  10.1986166073672  10.3974448991471  10.2291761637936  10.3422547513151  10.2640281603721
1367457_at  8.78904682538038  8.72229122394877  8.67050724938203  8.9448960477874   8.72229122394877  8.66089835690261  8.76340465790531  8.76513510798621  8.68486622829271  8.72229122394877  8.72229122394877
1367458_at  7.53098410601309  7.35718220143744  7.71709223686958  7.65558449687549  7.45098048336888  7.52823368962808  7.7461437721208   7.40572408979464  7.74545079274697  7.7485040416615   7.94327560850499
1367459_at  11.1105760949228  11.2159762048115  10.9101244239086  11.149432063699   10.8953374468102  10.970737308597   11.2355512291063  11.2441035772632  11.0435353491247  11.219515240616   11.0170431749943
1367460_at  9.96660388228079  9.84511446842093  9.94477628710098  9.91078389672863  9.84447947912025  9.94182586703467  9.87080073408934  9.70116063385167  9.49343722267783  9.27274366865815  9.279675932995

 

Adding annotations to the RobiNA data

Because RobiNA does not support non-plant organisms, it saves the data in a quite anonymous way with only probe IDs. The code below is borrowed from the GEO2R script and adds gene symbols and additional annotations to the RobiNA table.

R-code to annotate the RobiNA data

# Add annotations to the RobiNA full table
# the code below is borrowed to the GEO2R code and adapted
library(GEOquery)
 
# make sure that the surrent directory is set to folder enclosing the 'RobiNA-results' folder
base <- getwd()
 
# load RobiNA full table in a data-frame
robina.folder <- paste(base, "RobiNA-results", "detailed_results", sep="/")
robina.data <- paste(robina.folder, "full_table_Heart - Diaphragm.txt", sep="/")
robina.full <- read.delim(robina.data, as.is = TRUE)
 
# load NCBI platform annotation
gpl <- "GPL341"
platf <- getGEO(gpl, AnnotGPL=TRUE, destdir = base)
ncbifd <- data.frame(attr(dataTable(platf), "table"))
 
# replace original platform annotation
data <- merge(robina.full, ncbifd, by="ID")
data <- data[order(data$P.Value), ]  # restore correct order
 
# preview first 10 columns
head(data)[,c(1:10)]
> head(data)[,c(1:10)]
              ID    logFC   AveExpr         t      P.Value    adj.P.Val        B
3796  1371247_at 8.869049  9.206606 138.18038 7.726739e-23 1.230329e-18 40.30280
2961  1370412_at 8.007008  8.666362 112.67124 1.243625e-21 9.901121e-18 38.51476
11906 1387787_at 8.513843 10.013651 107.79122 2.272125e-21 1.205968e-17 38.08932
4744  1372195_at 9.441105  9.106164 100.52807 5.871517e-21 2.011803e-17 37.39354
3520  1370971_at 8.884836  8.447829  99.98895 6.317288e-21 2.011803e-17 37.33864
445   1367896_at 8.173194  8.328780  97.77615 8.566173e-21 2.273320e-17 37.10831
                                                                                         Gene.title Gene.symbol
3796                                                             troponin T type 3 (skeletal, fast)       Tnnt3
2961                                                             troponin T type 1 (skeletal, slow)       Tnnt1
11906                                    myosin light chain, phosphorylatable, fast skeletal muscle       Mylpf
4744                                                                       troponin C type 2 (fast)       Tnnc2
3520  myosin, heavy chain 2, skeletal muscle, adult///myosin, heavy chain 1, skeletal muscle, adult Myh2///Myh1
445                                                                            carbonic anhydrase 3        Car3
              Gene.ID
3796            24838
2961           171409
11906           24584
4744           296369
3520  691644///287408
445             54232
# save to new file
outfile <- paste(robina.folder, "GSE6943_DE-Robina.txt", sep="/")
write.table(data, file=outfile, row.names=F, sep="\t", quote=FALSE)
 
colnames(data)

The resulting file has gained 20 columns in addition to the first 7 created by RobiNA

 [1] "ID"                    "logFC"                 "AveExpr"               "t"                    
 [5] "P.Value"               "adj.P.Val"             "B"                     "Gene.title"          
 [9] "Gene.symbol"           "Gene.ID"               "UniGene.title"         "UniGene.symbol"      
[13] "UniGene.ID"            "Nucleotide.Title"      "GI"                    "GenBank.Accession"    
[17] "Platform_CLONEID"      "Platform_ORF"          "Platform_SPOTID"       "Chromosome.location"  
[21] "Chromosome.annotation" "GO.Function"           "GO.Process"            "GO.Component"        
[25] "GO.Function.ID"        "GO.Process.ID"         "GO.Component.ID"

 

Conclusion

RobiNA is a wrapper for R-code, developed and standardized by the authors to run reproducibly and do as they are expected. Although simple in layout, RobiNA appears as a quite performant alternative top more top-notch analysis methods reported in the literature. A computer with sufficient resources is wished to perform QC steps in a reasonable time but current strong laptops and desktops should do the job.

For those who cannot afford the more expensive CLC-Main workbench and do not wish to learn [R] and Bioconductor, RobiNA seems a good choice when in need of MA analysis (and it can also do standard RNASeq analysis - as will be described in a separate hands-on)

The complete file is accessible here (use right-click download linked file as or navigate from the page bottom-link)

 

download exercise files

Download exercise files here.

Use the right application to open the files present in ex4-files

References:
  1. Marc Lohse, Adriano Nunes-Nesi, Peter Krüger, Axel Nagel, Jan Hannemann, Federico M Giorgi, Liam Childs, Sonia Osorio, Dirk Walther, Joachim Selbig, Nese Sreenivasulu, Mark Stitt, Alisdair R Fernie, Björn Usadel
    Robin: an intuitive wizard application for R-based expression microarray quality assessment and analysis.
    Plant Physiol: 2010, 153(2);642-51
    [PubMed:20388663] ##WORLDCAT## [DOI] (I p)

    http://mapman.gabipd.org/web/guest/robin
    
  2. 2.0 2.1 http://www.affymetrix.com/estore/

[ Main_Page | Hands-on Analysis of public microarray datasets | PubMA_Exercise.3 | PubMA_Exercise.4 |
| PubMA_Exercise.5 ]