Tutorial: Handling R
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Training Units
Contents
Getting started
Command line and data types
Interacting with R at the command line usually runs in a loop:
- Type in expression
- Hit enter
- R evaluates the expression, which may include
- performing calculations,
- creating a plot,
- reading data from a file,
- other,
and returns a value, which is usually printed to the console. Unless the value of an expression is explicitly stored, it is generally displayed and discarded. Example:
> 17 + 25 [1] 42 >
Elementary data types:
- Numeric:
> 42 > 1.7 + 1.5 > (-3.2*7 + 185/7)*10 + 1.714286
- Character:
> "a" > "a longer text" > "Special characters \n \t \\ "
- Logical:
> TRUE > FALSE > 5 < 7 > (5 < 7) & ("a" == "b")
Applying functions to data
Functions are in following format:
FunctionXyz(value_parameter1, key2=value2, key3=value3)
.
Functions can take any kind of expressions as arguments:
> sqrt(24) [1] 4.898979 > exp(3) [1] 20.08554 > nchar("abcd") [1] 4
Functions can be easily used in expressions:
> nchar("abcde") + 14 > sqrt(360)
Help on functions is easily available:
> help("sqrt") > ?sqrt
Functions can take several arguments:
> ?substr
We learn that substr
takes three arguments: x
, start
and stop
.
We can match formal arguments and data by position:
> substr("abdce", 1, 3) [1] "abd"
Alternatively, we can match by name, which can be abbreviated. These are all equivalent:
> substr(x="abdce", start=1, stop=3) > substr("abdce", start=1, stop=3) > substr("abdce", sto=3, sta=1) > substr(sto=3, sta=1, x="abdce") > substr(3, sta=1, x="abdce")
Some arguments have a reasonable default value; these arguments do not need to be specified. Example:
> ?log
We learn that log
takes two arguments: x
and base
.
The default value for base
is Euler's constant e:
> exp(1) [1] 2.718282
Calling log
without specifying base
therefore returns the natural logarithm (to base e):
> log(100) [1] 4.60517
Alternatively, we can specify any other base:
> log(100, 10) [1] 2 > log(base=2, x=32) [1] 5
Storing values as objects
Expressions can be assigned to freely named objects:
> x <- 13*2 + 7 > text.1 <- "Treatment A" > condition_2 <- TRUE
The objects evaluate to the expressions assigned to them:
> x [1] 33 > text.1 [1] "Treatment A" > condition_2 [1] TRUE
Objects can be used in expressions in the expected manner:
> abs(x - 43) [1] 10 > substr(text.1, 1, 5) [1] "Treat" > (x > 15) & condition_2 [1] TRUE
Each assignment creates an object, which is stored in R's workspace.
The function ls
lists all currently defined objects in the workspace:
> ls() [1] "condition_2" "text.1" "x"
The functionrm
removes (deletes) objects from the workspace:
> rm(condition_2) > ls() [1] "text.1" "x" > condition_2 Error: object "condition_2" not found
Complex data
Vectors
The simplest way of keeping track of multiple data of the same type is a vector. A generic way of creating vectors is by combining the data explicitly:
> c(1, 17.2, -6, 123) > c("first","middle","last") > c(TRUE, FALSE, TRUE, TRUE, FALSE)
Like everything else, vectors can be stored as objects:
> weights = c(75, 72, 84, 53, 67, 62, 85, 107) > weights [1] 75 72 84 53 67 62 85 107
Vectors have a specified length; individual elements are usually addressed through their position:
> length(weights) [1] 8 > weights[3] [1] 84
A vector is the elementary data structure. By applying functions, we can do useful stuff:
> sum(weights) [1] 605 > sum(weights)/length(weights) [1] 75.625
Useful statistical functions:
> mean(weights) [1] 75.625 > sd(weights) [1] 16.59550 > range(weights) [1] 53 107 > summary(weights) Min. 1st Qu. Median Mean 3rd Qu. Max. 53.00 65.75 73.50 75.62 84.25 107.00
Useful shorthands for creating vectors - consecutive integers:
> 1:5 [1] 1 2 3 4 5
Any regular sequence of numbers:
> seq(1, 10, by=2) [1] 1 3 5 7 9 > seq(1, 100 length=10) [1] 1 12 23 34 45 56 67 78 89 100
Extracting parts of a vector: using the [
operator with position index:
> weights[1] [1] 75 > weights[length(weights)] [1] 107
Negative index removes elements:
> weights[-1] [1] 72 84 53 67 62 85 107
Vector of positions extracts/drops sub-vector:
> weights[1:4] [1] 75 72 84 53 > weights[-c(1,8)] [1] 72 84 53 67 62 85
Using a logical vector to select elements:
> ndx = weights > 80 > ndx [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE > weights[ndx] [1] 84 85 107
Changing elements of a vector: just assign a new value to the position put between[]
:
> x = 1:10 > x[1] = -1 > x [1] -1 2 3 4 5 6 7 8 9 10
This works also for vectors:
> x[1:5] = -x[1:5] > odd = 1:10 %% 2 > odd [1] 1 0 1 0 1 0 1 0 1 0 > x[odd==1] = x[odd==1] * 100
Functions are applied elementwise; vectors are combined elementwise:
> heights = c(1.72, 1.77, 1.82, 1.62, 1.65, 1.71, 1.89, 1.92) > bmi = weights/heights^2 > round(bmi, 1) [1] 25.4 23.0 25.4 20.2 24.6 21.2 23.8 29.0
Factors
Factors are the standard way to store vectors of categorical data. Categorical data could be coded as numeric or character:
> sex = rep(1:2, c(5,5)) > treat = rep(c("Control","Treatment"), c(5,5))
However, it is generally much preferable to convert them to factors:
> sex = factor(sex, levels=c(1,2), labels=c("m","f")) > sex [1] m m m m m f f f f f Levels: m f
Even simpler for character vectors:
> treat = factor(treat) > treat [1] Control Control Control Control Control [6] Treatment Treatment Treatment Treatment Treatment Levels: Control Treatment
Useful things to do with factors - tabulation:
> table(sex) sex m f 5 5
Groupwise computation with tapply
:
> income = c(35, 32, 24, 17, 23, 22, 33, 28, 25, 20) > tapply(income, sex, mean) m f 26.2 25.6 > tapply(income, sex, sd) m f 7.259477 5.128353
Matrices
Matrices: rectangular arrangement of data of basic data types of the same kind.
> matrix(1:9, nrow=3, ncol=3) [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9
Only one of nrow<code> or <code>ncol
needs to be specified:
> matrix(c("a","b","c","d"), nrow=2, byrow=TRUE) [,1] [,2] [1,] "a" "b" [2,] "c" "d"
We can build a matrix from individual columns:
> mat1 = cbind(heights, weights, bmi) heights weights bmi [1,] 1.83 75 25.35154 [2,] 1.79 72 22.98190 [3,] 1.93 84 25.35926 [4,] 1.53 53 20.19509 [5,] 1.73 67 24.60973 [6,] 1.66 62 21.20311 [7,] 1.94 85 23.79553 [8,] 2.18 107 29.02561
Individual elements can be addressed using [
with two indices:
> mat1[1,1] [1] 1.83
Rows and columns by dropping one index:
> mat1[1,] heights weights bmi 1.83000 75.00000 25.35154 > mat1[,2] [1] 75 72 84 53 67 62 85 107
Where defined, we can use row and column names:
> mat1[,"heights"] [1] 1.83 1.79 1.93 1.53 1.73 1.66 1.94 2.18
Matrix arithmetics: all basic operations are elementwise:
> mat2 = matrix(1:9, nrow=3) > mat3 = matrix(1, nrow=3, ncol=3) > mat2 + mat3 > mat2 + 10
Data frames
We can combine vectors of any data type into a rectangular arrangement:
> overw = bmi > 25 > smoker = factor(c(1,2,1,2,3,3,1,1), levels=1:3, labels=c("yes","no","former")) > df1 = data.frame(weights, heights, overw, smoker) > df1[1:5,] weights heights overw smoker 1 75 1.83 TRUE yes 2 72 1.79 FALSE no 3 84 1.93 TRUE yes 4 53 1.53 FALSE no 5 67 1.73 FALSE former
This is the default data structure for any even moderately complex analysis.
Indexing data frames:
- Every indexing scheme that works for matrices works for data frames, too:
- integer vectors,
- logical vectors,
- row- and column names.
- Variables/columns can be accessed via
$
and their name:
> df1$weights [1] 75 72 84 53 67 62 85 107
Abbreviations are fine:
> df1$w [1] 75 72 84 53 67 62 85 107
Classes, methods, functions
Functions are objects, too. See function definition by entering the function name:
> log function (x, base = exp(1)) if (missing(base)) .Internal(log(x)) else .Internal(log(x, base)) <environment: namespace:base>
Functions can be easily written:
> MyMean = function(x) sum(x)/length(x) > MyMean(weights) [1] 75.625 > MyMean(heights) [1] 1.7625
The same function can have different effects depending on their data:
> summary(heights) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.620 1.695 1.745 1.763 1.838 1.920 > summary(smoker) yes no former 4 2 2
This depends on the class of the object:
> class(heights) [1] "numeric" > class(smoker) [1] "factor"
summary
is a generic method that has different methods
for different objects:
>summary function (object, ...) UseMethod("summary")
Interacting with the OS
Saving and restoring the workspace
Save all objects in the current workspace to file .RData
in the current
working directory:
> save.image()
Optionally, a different filename can be specified.
Load all objects in a saved working space file into the current working space:
> load(".RData")
Display the current working directory and display its content:
> getwd() > dir(getwd())
Change the current working directory to the home directory:
> setdwd("~") > getwd()
Loading data into R
- Reading in text files:
- Flexible interface:
read.table
- Fast for large data sets:
scan
- Flexible interface:
- Reading in Excel files:
- Save as tab-delimited/csv text files
- Package
xlsReadWrite
(Windows) - Package
gdata
(requires Perl -- easy under Unix)
- Files from other statistics packages:
- Package
foreign
(SAS, Stata, Octave, SPSS, \ldots)
- Package
- Extracting from database systems: abstract interface package
RDBI
with user packagesROracle
,RODBC
,RMySql
etc.
Saving results
- Cut & paste: for small problems
-
save
andload
to save specific objects:
> save(df1, file="df1.RData") > load("df1.RData")
- Redirect numerical output to file:
> sink("Results.txt") > summary(weights) > sink() > file.show("Results.txt")
- Redirect graphical output to file:
> wmf("Results.wmf") > hist(weights) > dev.off()
Saving R code
We can display the last commands using history
and save them via
savehistory
:
> history() > savehistory() > file.show(".Rhistory")
The history file can then be renamed and edited to taste. The commands in the modified file can be executed using the command source
.
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Training Units