How to create boxplots of microarray data
Go to parent Analyze your own microarray data in R/Bioconductor
The code on this page works for both affy and oligo
How to plot the box plots of the raw data of each array ? |
---|
Box plots are created via the boxplot() method which can be applied directly on an AffyBatch/FeatureSet object.
name = "boxplot.jpg" jpeg(name) boxplot(data,which='pm',col='red',names=ph@data$sample) dev.off() The which argument allows you to specify if you want to use:
Since the use of MM probes is highly controversial, we will work with PM probes only. |
Box plots allow you to assess if the scale and distribution of the data on different arrays is comparable. Differences in shape or center of the boxes indicate that normalization of the data is required.
You can also create box plots using ggplot.
How to create the box plot of the raw data using ggplot ? |
---|
First we need to get the data in the correct format for ggplot: a data frame with log intensities in one column and sample names in the second column.
We'll only use the PM intensities pmexp = pm(data) We create two empty vectors that will serve as the two columns of the data frame:
The dim() method returns the number of rows (probes) and columns (samples) of the matrix containing the PM intensities. The number of rows corresponds to the number of PM intensities (probes) that is available for each sample (in this example that's 251078). The sampleNames vector needs the same number of rows consisting of sample names. The logs vector will contain the log PM intensities of all six samples stacked into a single column. Up to this point we have extracted raw PM intensities, they are not yet log-transformed! You can do the log transformation using the log2() method. sampleNames = vector() logs = vector() for (i in 1:6) { sampleNames = c(sampleNames,rep(ph@data[i,1],dim(pmexp)[1])) logs = c(logs,log2(pmexp[,i])) } If you have 3 groups of 3 replicates, the code is as follows: sampleNames = vector() logs = vector() for (i in 1:9) { sampleNames = c(sampleNames,rep(ph@data[i,1],dim(pmexp)[1])) logs = c(logs,log2(pmexp[,i])) } After we have filled the vectors with the data we need, we combine sample names and log intensities into a single data frame: logData = data.frame(logInt=logs,sampleName=sampleNames) Now we can create the plot: dataBox = ggplot(logData,aes(sampleName,logInt)) dataBox + geom_boxplot() |
How to create a boxplot of normalized intensities ? |
---|
It's the same code as for raw intensities but this you use the normalized intensities as input (data.matrix) instead of the raw data:
name = "boxplotnorm.jpg" jpeg(name) boxplot(data.matrix,col='red',names=ph@data$sample) dev.off() |
How to create a box plot of normalized intensities using ggplot ? |
---|
First we need to get the data in the correct format for ggplot: a data frame with normalized log intensities in one column and sample names in the second column.
We create two vectors:
The dim() method returns the number of rows (probe sets) and columns (samples) of the matrix containing the normalized intensities. The number of rows corresponds to the number of normalized intensities (probe sets) in each sample (in this example it's 22810). The sampleNames vector needs the same number of entries of each sample name stacked into a single column. The normlogs vector will contain the normalized intensities of all six samples stacked into a single column. sampleNames = vector() normlogs = vector() for (i in 1:6) { sampleNames = c(sampleNames,rep(ph@data[i,1],dim(data.matrix)[1])) normlogs = c(normlogs,data.matrix[,i]) } If you have 3 groups of 3 samples, the code is as follows: sampleNames = vector() normlogs = vector() for (i in 1:9) { sampleNames = c(sampleNames,rep(ph@data[i,1],dim(data.matrix)[1])) normlogs = c(normlogs,data.matrix[,i]) } After having filled the vectors with data, we combine sample names and normalized intensities into a single data frame: normData = data.frame(norm_logInt=normlogs,sampleName=sampleNames) Now we can create the plot. To compare this plot to the boxplot we made for raw log intensities we have to set them to the same Y-scale. The scale for the normalized data is much lower than these for the raw data so we have to make the box plot setting the Y-scale from 2 to 16. To set the Y-scale we can use the ylim() method. dataBox = ggplot(normData, aes(sampleName,norm_logInt)) dataBox + geom_boxplot() + ylim(2,16) + ggtitle("after normalization") Additionally, we have to make the box plot of the raw data again this time setting the Y-scale from 2 to 16. dataBox = ggplot(logData,aes(sampleName,logInt)) dataBox + geom_boxplot() + ylim(2,16) + ggtitle("before normalization") |
After normalization, none of the samples should stand out from the rest. The different arrays should have the same (or at least a very comparable) median expression level. Also the scale of the boxes should be almost the same indicating that also the spread of the intensity values on the different arrays is comparable.