Manipulation of Data

Mahbubul Majumder, PhD
Sep 18, 2014

What is data manipulation?

Is it data doctoring?
Why do we need this?
- sorting, rearranging
- merging and grouping
- creating new variables
- change the variable types
- aggregating
How can we do this effectively?
- grammar of data manipulation

Changing variable types

m <- c("5","7","9","4","8")
mf <- factor(m)
mf

[1] 5 7 9 4 8
Levels: 4 5 7 8 9

as.numeric(mf)

[1] 2 3 5 1 4

as.numeric(m)

[1] 5 7 9 4 8

as.numeric(as.character(mf))

[1] 5 7 9 4 8

What did we learn?
- character to factor
- character to numeric
- factor to character
- factor to numeric conversion ??
Ordered factor levels

mf[1]<mf[2]

[1] NA

vo <- factor(m, ordered=1:5)
vo

[1] 5 7 9 4 8
Levels: 4 < 5 < 7 < 8 < 9

Sorting, ordering and arranging data

x <- c(4,5,8,9,3)
order(x)

[1] 5 1 2 3 4

x[order(x)]

[1] 3 4 5 8 9

x[order(-x)]

[1] 9 8 5 4 3

y <- c("A","F","G","E","D")
y[order(x)]

[1] "D" "A" "F" "G" "E"

qplot(y,x) + geom_bar(stat="identity")

plot of chunk unnamed-chunk-5

z <- reorder(y,x)
qplot(z,x) + geom_bar(stat="identity")

plot of chunk unnamed-chunk-5

Rearranging data

head(tips)

  total_bill  tip    sex smoker day   time size
1      16.99 1.01 Female     No Sun Dinner    2
2      10.34 1.66   Male     No Sun Dinner    3
3      21.01 3.50   Male     No Sun Dinner    3
4      23.68 3.31   Male     No Sun Dinner    2
5      24.59 3.61 Female     No Sun Dinner    4
6      25.29 4.71   Male     No Sun Dinner    4

ordered_tips <- tips[order(-tips$tip),]
head(ordered_tips)

    total_bill   tip  sex smoker  day   time size
171      50.81 10.00 Male    Yes  Sat Dinner    3
213      48.33  9.00 Male     No  Sat Dinner    4
24       39.42  7.58 Male     No  Sat Dinner    4
60       48.27  6.73 Male     No  Sat Dinner    4
142      34.30  6.70 Male     No Thur  Lunch    6
184      23.17  6.50 Male    Yes  Sun Dinner    4

Aggregating data

Aggregation can be done using any function

Note: The function may return more than one row

aggregating-data

Apply family

apply() tapply() sapply()

m <- matrix(1:6,ncol=3)
df <- data.frame(m)
df

  X1 X2 X3
1  1  3  5
2  2  4  6

apply(df,2,sum)

X1 X2 X3 
 3  7 11

apply(df,1,sum)

[1]  9 12

x <- c(3,4,6,7,8,9)
y <- c(1,1,2,2,3,3)
tapply(x,y,mean)

  1   2   3 
3.5 6.5 8.5

foo <- function(x){
  return(2^x)
  }
sapply(x,foo)

[1]   8  16  64 128 256 512

?mapply ?lapply ?vapply
Quiz:
sapply(x,'+',1)
mapply('+',x,1)

Merging and augmentation of data

df1

df2

  w1   w2
1  b john
2  c Eric

Suppose we want to merge based on columns v1 and w1. How can we do that?

merge(df1,df2,by.x='v1',by.y='w1')

  v1 v2   w2
1  b  3 john
2  b  5 john
3  c  4 Eric

Columns v1 and w1 should be comparable
If columns v1 and w1 are texts, it is wise to change them to lower case to make sure there is no trouble in comparing.
? tolower()
? merge()
Quiz: When and why we should do the following?

merge(df1,df2,by.x='v1',by.y='w1',
      all.x=TRUE)

Standardization of data

We have some data that shows weights of different animals

df

  mice.wt lion.wt whale.wt
1       1     301     2988
2       3     280     3036
3       1     312     3047
4       7     269     2956
5       4     308     3000
6       1     283     3021

How can we compare?
We need to change them in such a way that they become unit free
One way to do it is to standardize
(x - center)/scale

scale(df)

     mice.wt lion.wt whale.wt
[1,] -0.7634  0.5116  -0.5954
[2,]  0.0694 -0.7046   0.8335
[3,] -0.7634  1.1486   1.1610
[4,]  1.7351 -1.3416  -1.5480
[5,]  0.4858  0.9169  -0.2382
[6,] -0.7634 -0.5309   0.3870
attr(,"scaled:center")
 mice.wt  lion.wt whale.wt 
   2.833  292.167 3008.000 
attr(,"scaled:scale")
 mice.wt  lion.wt whale.wt 
   2.401   17.268   33.592

Now three columns are comparable
Quiz:
Give a situation where we need to do this