Tutorial: Handling R

From BITS wiki
Jump to: navigation, search
Go to parent Introduction to R/Bioconductor for analysis of microarray data#Training Units

Getting started

Command line and data types

Interacting with R at the command line usually runs in a loop:

  • Type in expression
  • Hit enter
  • R evaluates the expression, which may include
    • performing calculations,
    • creating a plot,
    • reading data from a file,
    • other,

and returns a value, which is usually printed to the console. Unless the value of an expression is explicitly stored, it is generally displayed and discarded. Example:

> 17 + 25
[1] 42
>

Elementary data types:

  • Numeric:
> 42
> 1.7 + 1.5
> (-3.2*7 + 185/7)*10 + 1.714286
  • Character:
> "a"
> "a longer text"
> "Special characters \n \t \\ "
  • Logical:
> TRUE
> FALSE
> 5 < 7
> (5 < 7) & ("a" == "b")

Applying functions to data

Functions are in following format: FunctionXyz(value_parameter1, key2=value2, key3=value3). Functions can take any kind of expressions as arguments:

> sqrt(24)
[1] 4.898979
> exp(3)
[1] 20.08554
> nchar("abcd")
[1] 4

Functions can be easily used in expressions:

> nchar("abcde") + 14 > sqrt(360)

Help on functions is easily available:

> help("sqrt")
> ?sqrt

Functions can take several arguments:

> ?substr

We learn that substr takes three arguments: x, start and stop.

We can match formal arguments and data by position:

> substr("abdce", 1, 3)
[1] "abd"

Alternatively, we can match by name, which can be abbreviated. These are all equivalent:

> substr(x="abdce", start=1, stop=3)
> substr("abdce", start=1, stop=3)
> substr("abdce", sto=3, sta=1)
> substr(sto=3, sta=1, x="abdce")
> substr(3, sta=1, x="abdce")

Some arguments have a reasonable default value; these arguments do not need to be specified. Example:

> ?log

We learn that log takes two arguments: x and base.

The default value for base is Euler's constant e:

> exp(1)
[1] 2.718282

Calling log without specifying base therefore returns the natural logarithm (to base e):

> log(100)
[1] 4.60517

Alternatively, we can specify any other base:

> log(100, 10)
[1] 2
> log(base=2, x=32)
[1] 5

Storing values as objects

Expressions can be assigned to freely named objects:

> x <- 13*2 + 7
> text.1 <- "Treatment A"
> condition_2 <- TRUE

The objects evaluate to the expressions assigned to them:

> x
[1] 33
> text.1
[1] "Treatment A"
> condition_2
[1] TRUE

Objects can be used in expressions in the expected manner:

> abs(x - 43)
[1] 10
> substr(text.1, 1, 5)
[1] "Treat"
> (x > 15) & condition_2
[1] TRUE


Each assignment creates an object, which is stored in R's workspace.

The function ls lists all currently defined objects in the workspace:

> ls()
[1] "condition_2" "text.1"      "x"

The functionrm removes (deletes) objects from the workspace:

> rm(condition_2)
> ls()
[1] "text.1" "x"     
> condition_2
Error: object "condition_2" not found

Complex data

Vectors

The simplest way of keeping track of multiple data of the same type is a vector. A generic way of creating vectors is by combining the data explicitly:

> c(1, 17.2, -6, 123)
> c("first","middle","last")
> c(TRUE, FALSE, TRUE, TRUE, FALSE)

Like everything else, vectors can be stored as objects:

> weights = c(75, 72, 84, 53, 67, 62, 85, 107)
> weights
[1]  75  72  84  53  67  62  85 107

Vectors have a specified length; individual elements are usually addressed through their position:

> length(weights)
[1] 8
> weights[3]
[1] 84


A vector is the elementary data structure. By applying functions, we can do useful stuff:

> sum(weights)
[1] 605
> sum(weights)/length(weights)
[1] 75.625

Useful statistical functions:

> mean(weights)
[1] 75.625
> sd(weights)
[1] 16.59550
> range(weights)
[1]  53 107
> summary(weights)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  53.00   65.75   73.50   75.62   84.25  107.00

Useful shorthands for creating vectors - consecutive integers:

> 1:5
[1] 1 2 3 4 5

Any regular sequence of numbers:

> seq(1, 10, by=2)
[1] 1 3 5 7 9
> seq(1, 100 length=10)
[1]   1  12  23  34  45  56  67  78  89 100

Extracting parts of a vector: using the [ operator with position index:

> weights[1]
[1] 75
> weights[length(weights)]
[1] 107

Negative index removes elements:

> weights[-1]
[1]  72  84  53  67  62  85 107

Vector of positions extracts/drops sub-vector:

> weights[1:4]
[1] 75 72 84 53
> weights[-c(1,8)]
[1] 72 84 53 67 62 85

Using a logical vector to select elements:

> ndx = weights > 80
> ndx
[1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
> weights[ndx]
[1]  84  85 107

Changing elements of a vector: just assign a new value to the position put between[]:

> x = 1:10
> x[1] = -1
> x
[1] -1  2  3  4  5  6  7  8  9 10

This works also for vectors:

> x[1:5] = -x[1:5]
> odd = 1:10 %% 2
> odd
[1] 1 0 1 0 1 0 1 0 1 0
> x[odd==1] = x[odd==1] * 100

Functions are applied elementwise; vectors are combined elementwise:

> heights = c(1.72, 1.77, 1.82, 1.62, 1.65, 1.71, 1.89, 1.92)
> bmi = weights/heights^2
> round(bmi, 1)
[1] 25.4 23.0 25.4 20.2 24.6 21.2 23.8 29.0

Factors

Factors are the standard way to store vectors of categorical data. Categorical data could be coded as numeric or character:

> sex = rep(1:2, c(5,5))
> treat = rep(c("Control","Treatment"), c(5,5)) 

However, it is generally much preferable to convert them to factors:

> sex = factor(sex, levels=c(1,2), labels=c("m","f"))
> sex
[1] m m m m m f f f f f
Levels: m f

Even simpler for character vectors:

> treat = factor(treat)
> treat
[1] Control   Control   Control   Control   Control   
[6] Treatment Treatment Treatment Treatment Treatment
Levels: Control Treatment

Useful things to do with factors - tabulation:

> table(sex)
sex
m f 
5 5 

Groupwise computation with tapply:

> income = c(35, 32, 24, 17, 23, 22, 33, 28, 25, 20)
> tapply(income, sex, mean)
   m    f 
26.2 25.6 
> tapply(income, sex, sd)
       m        f 
7.259477 5.128353

Matrices

Matrices: rectangular arrangement of data of basic data types of the same kind.

> matrix(1:9, nrow=3, ncol=3)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Only one of nrow<code> or <code>ncol needs to be specified:

> matrix(c("a","b","c","d"), nrow=2, byrow=TRUE)
     [,1] [,2]
[1,] "a"  "b" 
[2,] "c"  "d"

We can build a matrix from individual columns:

> mat1 = cbind(heights, weights, bmi)
     heights weights      bmi
[1,]   1.83     75 25.35154
[2,]   1.79     72 22.98190
[3,]   1.93     84 25.35926
[4,]   1.53     53 20.19509
[5,]   1.73     67 24.60973
[6,]   1.66     62 21.20311
[7,]   1.94     85 23.79553
[8,]   2.18    107 29.02561

Individual elements can be addressed using [ with two indices:

> mat1[1,1]
[1] 1.83

Rows and columns by dropping one index:

> mat1[1,]
  heights   weights      bmi 
 1.83000 75.00000 25.35154 
> mat1[,2]
[1]  75  72  84  53  67  62  85 107

Where defined, we can use row and column names:

> mat1[,"heights"]
[1] 1.83 1.79 1.93 1.53 1.73 1.66 1.94 2.18

Matrix arithmetics: all basic operations are elementwise:

> mat2 = matrix(1:9, nrow=3)
> mat3 = matrix(1, nrow=3, ncol=3)
> mat2 + mat3
> mat2 + 10

Data frames

We can combine vectors of any data type into a rectangular arrangement:

> overw  = bmi > 25
> smoker = factor(c(1,2,1,2,3,3,1,1), levels=1:3, labels=c("yes","no","former"))
> df1 = data.frame(weights, heights, overw, smoker)
> df1[1:5,]
  weights heights overw smoker
1     75   1.83  TRUE    yes
2     72   1.79 FALSE     no
3     84   1.93  TRUE    yes
4     53   1.53 FALSE     no
5     67   1.73 FALSE former

This is the default data structure for any even moderately complex analysis.

Indexing data frames:

  • Every indexing scheme that works for matrices works for data frames, too:
    • integer vectors,
    • logical vectors,
    • row- and column names.
  • Variables/columns can be accessed via $ and their name:
> df1$weights
[1]  75  72  84  53  67  62  85 107

Abbreviations are fine:

> df1$w
[1]  75  72  84  53  67  62  85 107

Classes, methods, functions

Functions are objects, too. See function definition by entering the function name:

> log
function (x, base = exp(1)) 
if (missing(base)) 
.Internal(log(x)) else .Internal(log(x, base))
<environment: namespace:base>

Functions can be easily written:

> MyMean = function(x) sum(x)/length(x)
> MyMean(weights)
[1] 75.625
> MyMean(heights)
[1] 1.7625

The same function can have different effects depending on their data:

> summary(heights)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.620   1.695   1.745   1.763   1.838   1.920 
> summary(smoker)
   yes     no former 
     4      2      2 

This depends on the class of the object:

> class(heights)
[1] "numeric"
> class(smoker)
[1] "factor"

summary is a generic method that has different methods for different objects:

>summary
function (object, ...) 
UseMethod("summary")

Interacting with the OS

Saving and restoring the workspace

Save all objects in the current workspace to file .RData in the current working directory:

> save.image()

Optionally, a different filename can be specified.

Load all objects in a saved working space file into the current working space:

> load(".RData")

Display the current working directory and display its content:

> getwd()
> dir(getwd())

Change the current working directory to the home directory:

> setdwd("~")
> getwd()

Loading data into R

  • Reading in text files:
    • Flexible interface: read.table
    • Fast for large data sets: scan
  • Reading in Excel files:
    • Save as tab-delimited/csv text files
    • Package xlsReadWrite (Windows)
    • Package gdata (requires Perl -- easy under Unix)
  • Files from other statistics packages:
    • Package foreign (SAS, Stata, Octave, SPSS, \ldots)
  • Extracting from database systems: abstract interface package RDBI with user packages ROracle, RODBC, RMySql etc.

Saving results

  • Cut & paste: for small problems
  • save and load to save specific objects:
> save(df1, file="df1.RData")
> load("df1.RData")
  • Redirect numerical output to file:
> sink("Results.txt")
> summary(weights)
> sink()
> file.show("Results.txt")
  • Redirect graphical output to file:
> wmf("Results.wmf")
> hist(weights)
> dev.off()

Saving R code

We can display the last commands using history and save them via savehistory:

> history()
> savehistory()
> file.show(".Rhistory")

The history file can then be renamed and edited to taste. The commands in the modified file can be executed using the command source.

Go to parent Introduction to R/Bioconductor for analysis of microarray data#Training Units