Mahbubul Majumder, PhD
Sep 18, 2014
Is it data doctoring?
Why do we need this?
How can we do this effectively?
m <- c("5","7","9","4","8")
mf <- factor(m)
mf
[1] 5 7 9 4 8
Levels: 4 5 7 8 9
as.numeric(mf)
[1] 2 3 5 1 4
as.numeric(m)
[1] 5 7 9 4 8
as.numeric(as.character(mf))
[1] 5 7 9 4 8
What did we learn?
Ordered factor levels
mf[1]<mf[2]
[1] NA
vo <- factor(m, ordered=1:5)
vo
[1] 5 7 9 4 8
Levels: 4 < 5 < 7 < 8 < 9
x <- c(4,5,8,9,3)
order(x)
[1] 5 1 2 3 4
x[order(x)]
[1] 3 4 5 8 9
x[order(-x)]
[1] 9 8 5 4 3
y <- c("A","F","G","E","D")
y[order(x)]
[1] "D" "A" "F" "G" "E"
qplot(y,x) + geom_bar(stat="identity")
z <- reorder(y,x)
qplot(z,x) + geom_bar(stat="identity")
head(tips)
total_bill tip sex smoker day time size
1 16.99 1.01 Female No Sun Dinner 2
2 10.34 1.66 Male No Sun Dinner 3
3 21.01 3.50 Male No Sun Dinner 3
4 23.68 3.31 Male No Sun Dinner 2
5 24.59 3.61 Female No Sun Dinner 4
6 25.29 4.71 Male No Sun Dinner 4
ordered_tips <- tips[order(-tips$tip),]
head(ordered_tips)
total_bill tip sex smoker day time size
171 50.81 10.00 Male Yes Sat Dinner 3
213 48.33 9.00 Male No Sat Dinner 4
24 39.42 7.58 Male No Sat Dinner 4
60 48.27 6.73 Male No Sat Dinner 4
142 34.30 6.70 Male No Thur Lunch 6
184 23.17 6.50 Male Yes Sun Dinner 4
Aggregation can be done using any function
Note: The function may return more than one row
apply() tapply() sapply()m <- matrix(1:6,ncol=3)
df <- data.frame(m)
df
X1 X2 X3
1 1 3 5
2 2 4 6
apply(df,2,sum)
X1 X2 X3
3 7 11
apply(df,1,sum)
[1] 9 12
x <- c(3,4,6,7,8,9)
y <- c(1,1,2,2,3,3)
tapply(x,y,mean)
1 2 3
3.5 6.5 8.5
foo <- function(x){
return(2^x)
}
sapply(x,foo)
[1] 8 16 64 128 256 512
?mapply ?lapply ?vapplysapply(x,'+',1) mapply('+',x,1)df1
v1 v2
1 a 5
2 a 8
3 b 3
4 c 4
5 b 5
df2
w1 w2
1 b john
2 c Eric
v1 and w1. How can we do that?merge(df1,df2,by.x='v1',by.y='w1')
v1 v2 w2
1 b 3 john
2 b 5 john
3 c 4 Eric
v1 and w1 should be comparablev1 and w1 are texts, it is wise to change them to lower case to make sure there is no trouble in comparing.? tolower()? merge()merge(df1,df2,by.x='v1',by.y='w1',
all.x=TRUE)
df
mice.wt lion.wt whale.wt
1 1 301 2988
2 3 280 3036
3 1 312 3047
4 7 269 2956
5 4 308 3000
6 1 283 3021
How can we compare?
We need to change them in such a way that they become unit free
One way to do it is to standardize
(x - center)/scale
scale(df)
mice.wt lion.wt whale.wt
[1,] -0.7634 0.5116 -0.5954
[2,] 0.0694 -0.7046 0.8335
[3,] -0.7634 1.1486 1.1610
[4,] 1.7351 -1.3416 -1.5480
[5,] 0.4858 0.9169 -0.2382
[6,] -0.7634 -0.5309 0.3870
attr(,"scaled:center")
mice.wt lion.wt whale.wt
2.833 292.167 3008.000
attr(,"scaled:scale")
mice.wt lion.wt whale.wt
2.401 17.268 33.592
Now three columns are comparable
Quiz:
Give a situation where we need to do this