Mahbubul Majumder, PhD
Sep 18, 2014
Is it data doctoring?
Why do we need this?
How can we do this effectively?
m <- c("5","7","9","4","8")
mf <- factor(m)
mf
[1] 5 7 9 4 8
Levels: 4 5 7 8 9
as.numeric(mf)
[1] 2 3 5 1 4
as.numeric(m)
[1] 5 7 9 4 8
as.numeric(as.character(mf))
[1] 5 7 9 4 8
What did we learn?
Ordered factor levels
mf[1]<mf[2]
[1] NA
vo <- factor(m, ordered=1:5)
vo
[1] 5 7 9 4 8
Levels: 4 < 5 < 7 < 8 < 9
x <- c(4,5,8,9,3)
order(x)
[1] 5 1 2 3 4
x[order(x)]
[1] 3 4 5 8 9
x[order(-x)]
[1] 9 8 5 4 3
y <- c("A","F","G","E","D")
y[order(x)]
[1] "D" "A" "F" "G" "E"
qplot(y,x) + geom_bar(stat="identity")
z <- reorder(y,x)
qplot(z,x) + geom_bar(stat="identity")
head(tips)
total_bill tip sex smoker day time size
1 16.99 1.01 Female No Sun Dinner 2
2 10.34 1.66 Male No Sun Dinner 3
3 21.01 3.50 Male No Sun Dinner 3
4 23.68 3.31 Male No Sun Dinner 2
5 24.59 3.61 Female No Sun Dinner 4
6 25.29 4.71 Male No Sun Dinner 4
ordered_tips <- tips[order(-tips$tip),]
head(ordered_tips)
total_bill tip sex smoker day time size
171 50.81 10.00 Male Yes Sat Dinner 3
213 48.33 9.00 Male No Sat Dinner 4
24 39.42 7.58 Male No Sat Dinner 4
60 48.27 6.73 Male No Sat Dinner 4
142 34.30 6.70 Male No Thur Lunch 6
184 23.17 6.50 Male Yes Sun Dinner 4
Aggregation can be done using any function
Note: The function may return more than one row
apply() tapply() sapply()
m <- matrix(1:6,ncol=3)
df <- data.frame(m)
df
X1 X2 X3
1 1 3 5
2 2 4 6
apply(df,2,sum)
X1 X2 X3
3 7 11
apply(df,1,sum)
[1] 9 12
x <- c(3,4,6,7,8,9)
y <- c(1,1,2,2,3,3)
tapply(x,y,mean)
1 2 3
3.5 6.5 8.5
foo <- function(x){
return(2^x)
}
sapply(x,foo)
[1] 8 16 64 128 256 512
?mapply ?lapply ?vapply
sapply(x,'+',1)
mapply('+',x,1)
df1
v1 v2
1 a 5
2 a 8
3 b 3
4 c 4
5 b 5
df2
w1 w2
1 b john
2 c Eric
v1
and w1
. How can we do that?merge(df1,df2,by.x='v1',by.y='w1')
v1 v2 w2
1 b 3 john
2 b 5 john
3 c 4 Eric
v1
and w1
should be comparablev1
and w1
are texts, it is wise to change them to lower case to make sure there is no trouble in comparing.? tolower()
? merge()
merge(df1,df2,by.x='v1',by.y='w1',
all.x=TRUE)
df
mice.wt lion.wt whale.wt
1 1 301 2988
2 3 280 3036
3 1 312 3047
4 7 269 2956
5 4 308 3000
6 1 283 3021
How can we compare?
We need to change them in such a way that they become unit free
One way to do it is to standardize
(x - center)/scale
scale(df)
mice.wt lion.wt whale.wt
[1,] -0.7634 0.5116 -0.5954
[2,] 0.0694 -0.7046 0.8335
[3,] -0.7634 1.1486 1.1610
[4,] 1.7351 -1.3416 -1.5480
[5,] 0.4858 0.9169 -0.2382
[6,] -0.7634 -0.5309 0.3870
attr(,"scaled:center")
mice.wt lion.wt whale.wt
2.833 292.167 3008.000
attr(,"scaled:scale")
mice.wt lion.wt whale.wt
2.401 17.268 33.592
Now three columns are comparable
Quiz:
Give a situation where we need to do this