Introduction to R and RStudio

Mahbubul Majumder, PhD
Sep 2, 2014

What is R?

  • Open source statistical programming language.
  • A collaborative effort
    • many people contribute
    • to include a package first we need to install the package
  install.packages("package-name")
  library("package-name")
  • Install the package only once, but you have to include it every time you start R
    • to view the version of R and the packages you already included
  sessionInfo()
  • You want to contribute? Build a package and upload it to Comprehensive R Archive Network, CRAN
  • Its free, it runs on Windows-MAC-Linux/Unix

RStudio environment

  • RStudio makes it convenient to use
  • Everything is organized, at your finger tip
  • Specially very convenient to create dynamic documents

Most important R command



?


  • Please use it and read the documentation
  • Example: to learn about plot function
? plot
  • Commands are executed in line

Modes of data in R

  • numeric
  • character
  • logical
  • factor
  • ordered
  • complex
  • raw
  • NULL

How do they work?

x <- "5.42"
is.numeric(x)
[1] FALSE
y <- as.numeric("5.44")
mode(y)
[1] "numeric"

R data structures

data-structure

Notice the difference

x <- c(1,2,3)
x
[1] 1 2 3
y <- c(1,2,"3")
y
[1] "1" "2" "3"
mode(y)
[1] "character"

Get properties of data structures

  • mode()
  • length(), nrow(), ncol(), dim()
  • names() rownames() colnames ()
x <- matrix(1:12, ncol=3)
x
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12
dim(x)
[1] 4 3

R objects

  • OO programming
  • Is everything an object or method or class?
  • Assigning value to an object
    • assign average to a object called myValue
myValue <- mean(1:20)
myValue
[1] 10.5
  • equality operator = also works
yourValue = mean(1:20)
yourValue
[1] 10.5

Generating data

  • Vectors c(), seq(), rep(), 1:n, rnorm(), rpois()
x <- seq(1,2.5,by=.5)
x
[1] 1.0 1.5 2.0 2.5
rnorm(mean=2,sd=3, n=3)
[1] 3.766 5.432 1.994
  • Matrix
y <- matrix(x,nrow=2)
y
     [,1] [,2]
[1,]  1.0  2.0
[2,]  1.5  2.5
  • List
z <- list(x, y)
z
[[1]]
[1] 1.0 1.5 2.0 2.5

[[2]]
     [,1] [,2]
[1,]  1.0  2.0
[2,]  1.5  2.5
  • know the structure
str(z)
List of 2
 $ : num [1:4] 1 1.5 2 2.5
 $ : num [1:2, 1:2] 1 1.5 2 2.5

Combining data

c(), rbind(), cbind(), data.frame ()

x <- seq(2, 6, by = 2)
c(x,0,20)
[1]  2  4  6  0 20
rbind(x,y=1:3)
  [,1] [,2] [,3]
x    2    4    6
y    1    2    3
zz <- cbind(x,y=1:3)
zz
     x y
[1,] 2 1
[2,] 4 2
[3,] 6 3
z <- c("A",1,"B")
d <- data.frame(zz, z) 
d
  x y z
1 2 1 A
2 4 2 1
3 6 3 B
names(d)
[1] "x" "y" "z"
str(d)
'data.frame':   3 obs. of  3 variables:
 $ x: num  2 4 6
 $ y: num  1 2 3
 $ z: Factor w/ 3 levels "1","A","B": 2 1 3

R operators

  • Simple math + - * /
  • More math ^ %/% %%
  • Compare == < > <= >= !=
  • Logical && || & | !
9 %/% 4
[1] 2
9 %% 4
[1] 1
9==4
[1] FALSE
! (9==5)
[1] TRUE
(9==5) && (5==5)
[1] FALSE
(9==5) | (5==5)
[1] TRUE


Quiz:
What is the difference between & and &&

User defined operator

  • We can define our own operator
  • Suppose we want to use %!% to obtain geometric mean of two numbers
  • We define our operator as follows
"%!%" <- function(x,y){
  res <- x*y
  if(res < 0) {
    cat("GM is undefined for given values")
    }else return(sqrt(res))
}
  • now let us use our operator
4%!%9
[1] 6
4%!%(-9)
GM is undefined for given values

Vectorized calculations

  • Functions work on the whole structure
  • Try not to use loops
  • There may be a way not to use loop
x <- rep(1:2,2)
x
[1] 1 2 1 2
x+4
[1] 5 6 5 6
sum(x^2)
[1] 10
y <- matrix(1:9,ncol=3)
y
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
# elementwise operation
y/y
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1
[3,]    1    1    1


Quiz:
What would happen for y/c(1,2,3)

Some statistical functions

  • Summary functions:
    mean() median() quantile() sd()
  • Fitting model: lm() glm() nlme() coef() anova()
  • Distribution:
    dbeta() dbinom() dcauchy() dchisq() dexp() dgamma() dnorm() dpois()
  • Use p for probability and r for random number. For example: pbeta() will compute beta probability and rbeta() will generate random number
x <- rpois(n=5, lambda=2)
x
[1] 0 2 2 2 3
mean(x)
[1] 1.8
ppois(x, lambda=2)
[1] 0.1353 0.6767 0.6767 0.6767 0.8571

Vector and matrix operations

x <- matrix(1:12,ncol=3)
x
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12
colMeans(x)
[1]  2.5  6.5 10.5
  • Transpose a matrix
t(x)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
  • Matrix multiplication
y <- rep(1,4)
y
[1] 1 1 1 1
y %*% x
     [,1] [,2] [,3]
[1,]   10   26   42
diag(x)
[1]  1  6 11
  • More functions solve() eigen() also ginv() which requires MASS package

Missing data

  • Missing data are reported as NA
    • To check if there is any missing data
x <- c(1,4,6,NA)
is.na(x)
[1] FALSE FALSE FALSE  TRUE
  • Functions may return NA
mean(x)
[1] NA
mean(x, na.rm=T)
[1] 3.667

Working with NA

  • Functions related to NA is.na() na.action() na.omit()
  • Use inside other function na.rm = T or F
x <-  c(3,6,9,NA,3)
x/3
[1]  1  2  3 NA  1
sum(x, na.rm=T)
[1] 21
y <- na.omit(x)
sum(y)
[1] 21

User defined functions in R

get_square <- function(x){
  if(x>100) 
    return("Big number") else
      return(x^2)
}

get_square(5)
[1] 25
get_square(500)
[1] "Big number"
get_square(c(25,200)) # what is wrong here?
[1]   625 40000

Recursive functions

foo <- function(x){
  print(x)
  if(x>1) foo(x-1)
}



foo(10) = ?



  • Recursion function vs stack overflow
  • Should be avoided for large computation
moo <- function(x){
  if(x>1) moo(x-1)
  print(x)
}



moo(10) = ?

Loops and controls in R

  • R has powerful functions and if you use them properly you don't need loop
  • If you use loop you may cost computing time
  • But if you have to use it, go ahead
  • Quiz:
    What is the output of this loop?
for (i in c("A","B", "C")){
  if(i > "B") print(i)
}

What is wrong here?

if(4==5) print("A") 
  else print("B")
  • Control flow
if(cond) expr 
if(cond) cons.expr  else  
  alt.expr

for(var in seq) expr 
while(cond) expr 
repeat expr 
break 
next

Getting data onto R

  • read.table() scan() load()
  • Reading data directly from the web
url <- "http://mahbub.stat.iastate.edu/ecg_fractal/ecg_cvp.csv"
dat <- read.table(url, header=T, sep=",")
head(dat)
     tm    ECG   CVP
1 0.000 -0.140 4.781
2 0.003 -0.153 4.781
3 0.006 -0.151 4.831
4 0.008 -0.137 4.831
5 0.011 -0.131 4.880
6 0.014 -0.136 4.930
  • more functions read.csv() readLines()
  • ? read.table

First steps to explore data

As soon as you read a data object

  • head(), tail()
  • summary(), str()
  • plot()
  • Example: built in data frame trees
head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7
plot(trees)

plot of chunk height-weight

Case study: A quick cluster analysis

  • Default database mtcars has 32 records and 11 variables. The variables are mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
  • Clustering the cars
d <- dist(as.matrix(mtcars))
hc <- hclust(d)
groups <- cutree(hc, k = 6)
plot(hc)
rect.hclust(hc,  border="blue",k=6)

plot of chunk unnamed-chunk-37

Reading assignments and resources