Manipulation of Data

Mahbubul Majumder, PhD
Sep 18, 2014

What is data manipulation?

  • Is it data doctoring?

  • Why do we need this?

    • sorting, rearranging
    • merging and grouping
    • creating new variables
    • change the variable types
    • aggregating
  • How can we do this effectively?

    • grammar of data manipulation

Changing variable types

m <- c("5","7","9","4","8")
mf <- factor(m)
mf
[1] 5 7 9 4 8
Levels: 4 5 7 8 9
as.numeric(mf)
[1] 2 3 5 1 4
as.numeric(m)
[1] 5 7 9 4 8
as.numeric(as.character(mf))
[1] 5 7 9 4 8
  • What did we learn?

    • character to factor
    • character to numeric
    • factor to character
    • factor to numeric conversion ??
  • Ordered factor levels

mf[1]<mf[2]
[1] NA
vo <- factor(m, ordered=1:5)
vo
[1] 5 7 9 4 8
Levels: 4 < 5 < 7 < 8 < 9

Sorting, ordering and arranging data

x <- c(4,5,8,9,3)
order(x)
[1] 5 1 2 3 4
x[order(x)]
[1] 3 4 5 8 9
x[order(-x)]
[1] 9 8 5 4 3
y <- c("A","F","G","E","D")
y[order(x)]
[1] "D" "A" "F" "G" "E"
qplot(y,x) + geom_bar(stat="identity")

plot of chunk unnamed-chunk-5

z <- reorder(y,x)
qplot(z,x) + geom_bar(stat="identity")

plot of chunk unnamed-chunk-5

Rearranging data

head(tips)
  total_bill  tip    sex smoker day   time size
1      16.99 1.01 Female     No Sun Dinner    2
2      10.34 1.66   Male     No Sun Dinner    3
3      21.01 3.50   Male     No Sun Dinner    3
4      23.68 3.31   Male     No Sun Dinner    2
5      24.59 3.61 Female     No Sun Dinner    4
6      25.29 4.71   Male     No Sun Dinner    4
ordered_tips <- tips[order(-tips$tip),]
head(ordered_tips)
    total_bill   tip  sex smoker  day   time size
171      50.81 10.00 Male    Yes  Sat Dinner    3
213      48.33  9.00 Male     No  Sat Dinner    4
24       39.42  7.58 Male     No  Sat Dinner    4
60       48.27  6.73 Male     No  Sat Dinner    4
142      34.30  6.70 Male     No Thur  Lunch    6
184      23.17  6.50 Male    Yes  Sun Dinner    4

Aggregating data





Aggregation can be done using any function



Note: The function may return more than one row

aggregating-data

Apply family

  • apply() tapply() sapply()
m <- matrix(1:6,ncol=3)
df <- data.frame(m)
df
  X1 X2 X3
1  1  3  5
2  2  4  6
apply(df,2,sum)
X1 X2 X3 
 3  7 11 
apply(df,1,sum)
[1]  9 12
x <- c(3,4,6,7,8,9)
y <- c(1,1,2,2,3,3)
tapply(x,y,mean)
  1   2   3 
3.5 6.5 8.5 
foo <- function(x){
  return(2^x)
  }
sapply(x,foo)
[1]   8  16  64 128 256 512
  • ?mapply ?lapply ?vapply
  • Quiz:
    sapply(x,'+',1)
    mapply('+',x,1)

Merging and augmentation of data

df1
  v1 v2
1  a  5
2  a  8
3  b  3
4  c  4
5  b  5
df2
  w1   w2
1  b john
2  c Eric
  • Suppose we want to merge based on columns v1 and w1. How can we do that?
merge(df1,df2,by.x='v1',by.y='w1')
  v1 v2   w2
1  b  3 john
2  b  5 john
3  c  4 Eric
  • Columns v1 and w1 should be comparable
  • If columns v1 and w1 are texts, it is wise to change them to lower case to make sure there is no trouble in comparing.
  • ? tolower()
  • ? merge()
  • Quiz: When and why we should do the following?
merge(df1,df2,by.x='v1',by.y='w1',
      all.x=TRUE)

Standardization of data

  • We have some data that shows weights of different animals
df
  mice.wt lion.wt whale.wt
1       1     301     2988
2       3     280     3036
3       1     312     3047
4       7     269     2956
5       4     308     3000
6       1     283     3021
  • How can we compare?

  • We need to change them in such a way that they become unit free

  • One way to do it is to standardize
    (x - center)/scale

scale(df)
     mice.wt lion.wt whale.wt
[1,] -0.7634  0.5116  -0.5954
[2,]  0.0694 -0.7046   0.8335
[3,] -0.7634  1.1486   1.1610
[4,]  1.7351 -1.3416  -1.5480
[5,]  0.4858  0.9169  -0.2382
[6,] -0.7634 -0.5309   0.3870
attr(,"scaled:center")
 mice.wt  lion.wt whale.wt 
   2.833  292.167 3008.000 
attr(,"scaled:scale")
 mice.wt  lion.wt whale.wt 
   2.401   17.268   33.592 
  • Now three columns are comparable

  • Quiz:
    Give a situation where we need to do this