How to create boxplots of microarray data

Go to parent Analyze your own microarray data in R/Bioconductor

The code on this page works for both affy and oligo

How to plot the box plots of the raw data of each array ?
Box plots are created via the boxplot() method which can be applied directly on an AffyBatch/FeatureSet object. name = "boxplot.jpg" jpeg(name) boxplot(data,which='pm',col='red',names=ph@data$sample) dev.off() The which argument allows you to specify if you want to use: perfect match probes only: which='pm' mismatch probes only: which='mm' both: which='both' Since the use of MM probes is highly controversial, we will work with PM probes only. The col argument specifies the color of the boxes, the names argument the labels on the X-axis.

Box plots allow you to assess if the scale and distribution of the data on different arrays is comparable. Differences in shape or center of the boxes indicate that normalization of the data is required.

You can also create box plots using ggplot.

How to create the box plot of the raw data using ggplot ?
First we need to get the data in the correct format for ggplot: a data frame with log intensities in one column and sample names in the second column. We'll only use the PM intensities pmexp = pm(data) We create two empty vectors that will serve as the two columns of the data frame: one to store the sample names in, called sampleNames one to store the log intensities in , called logs The initialization of these vectors is essential. If you run into problems when creating these vectors and you have to repeat the plotting, always reinitialize these vectors (recreate them so that they are empty) The dim() method returns the number of rows (probes) and columns (samples) of the matrix containing the PM intensities. The number of rows corresponds to the number of PM intensities (probes) that is available for each sample (in this example that's 251078). The sampleNames vector needs the same number of rows consisting of sample names. The logs vector will contain the log PM intensities of all six samples stacked into a single column. Up to this point we have extracted raw PM intensities, they are not yet log-transformed! You can do the log transformation using the log2() method. sampleNames = vector() logs = vector() for (i in 1:6) { sampleNames = c(sampleNames,rep(ph@data[i,1],dim(pmexp)[1])) logs = c(logs,log2(pmexp[,i])) } If you have 3 groups of 3 replicates, the code is as follows: sampleNames = vector() logs = vector() for (i in 1:9) { sampleNames = c(sampleNames,rep(ph@data[i,1],dim(pmexp)[1])) logs = c(logs,log2(pmexp[,i])) } After we have filled the vectors with the data we need, we combine sample names and log intensities into a single data frame: logData = data.frame(logInt=logs,sampleName=sampleNames) Now we can create the plot: dataBox = ggplot(logData,aes(sampleName,logInt)) dataBox + geom_boxplot()

How to create a boxplot of normalized intensities ?
It's the same code as for raw intensities but this you use the normalized intensities as input (data.matrix) instead of the raw data: name = "boxplotnorm.jpg" jpeg(name) boxplot(data.matrix,col='red',names=ph@data$sample) dev.off()

How to create a box plot of normalized intensities using ggplot ?
First we need to get the data in the correct format for ggplot: a data frame with normalized log intensities in one column and sample names in the second column. We create two vectors: one to store the sample names in, called sampleNames one to store the normalized log intensities in , called normlogs The dim() method returns the number of rows (probe sets) and columns (samples) of the matrix containing the normalized intensities. The number of rows corresponds to the number of normalized intensities (probe sets) in each sample (in this example it's 22810). The sampleNames vector needs the same number of entries of each sample name stacked into a single column. The normlogs vector will contain the normalized intensities of all six samples stacked into a single column. sampleNames = vector() normlogs = vector() for (i in 1:6) { sampleNames = c(sampleNames,rep(ph@data[i,1],dim(data.matrix)[1])) normlogs = c(normlogs,data.matrix[,i]) } If you have 3 groups of 3 samples, the code is as follows: sampleNames = vector() normlogs = vector() for (i in 1:9) { sampleNames = c(sampleNames,rep(ph@data[i,1],dim(data.matrix)[1])) normlogs = c(normlogs,data.matrix[,i]) } After having filled the vectors with data, we combine sample names and normalized intensities into a single data frame: normData = data.frame(norm_logInt=normlogs,sampleName=sampleNames) Now we can create the plot. To compare this plot to the boxplot we made for raw log intensities we have to set them to the same Y-scale. The scale for the normalized data is much lower than these for the raw data so we have to make the box plot setting the Y-scale from 2 to 16. To set the Y-scale we can use the ylim() method. To add a title to the plot we use the ggtitle() method. dataBox = ggplot(normData, aes(sampleName,norm_logInt)) dataBox + geom_boxplot() + ylim(2,16) + ggtitle("after normalization") Additionally, we have to make the box plot of the raw data again this time setting the Y-scale from 2 to 16. dataBox = ggplot(logData,aes(sampleName,logInt)) dataBox + geom_boxplot() + ylim(2,16) + ggtitle("before normalization")

After normalization, none of the samples should stand out from the rest. The different arrays should have the same (or at least a very comparable) median expression level. Also the scale of the boxes should be almost the same indicating that also the spread of the intensity values on the different arrays is comparable.

How to create boxplots of microarray data

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox