Handling missing and messy data

Mahbubul Majumder, PhD
Dec 9, 2014

Missing data

If some of the values of a variable are not available, we call it missing
- very common even in a big data scenario
- missing data occurs when
  - no information is found
  - there exists recording error
  - data was not collected properly
  - data is reshaped (what does it mean?)

Identifying missing data

Data looks good. Is there any missing value here?

 time sex dose
    1   M    3
    1   F    5
    2   M   10
    2   F   11
    3   F   10

Now let us create a two way table of doses to see if the data is complete.

library(reshape2)
dcast(dat,  time ~ sex)

  time  F  M
1    1  5  3
2    2 11 10
3    3 10 NA

Types of missing data

Missing Not At Random (MNAR)
- salary information is not answered by all (why? the reason is not collected)
- usually missing information are dependent on unobserved variables
  - a disease information could be missing due to social pressure
Missing At Random (MAR)
- missing may depend on other observed variables
- males may not like to report certain information
completely random (MCAR)
- does not usually depend on the other variables

Consequence of missing data

Missing event, when not at random (MNAR) may be informative
- gives information about unobserved information
- should not be ignored
- indicates the data may not be a representative sample
- results could be biased if ignored
We can ignore missing records if it is MAR or MCAR
- in fact most of the methods assume MAR
- since MAR depends on the observed data, given the observed data missing is independent of any unobserved information

Dealing with missing data

Perform a complete case analysis. Ignore or Delete all missing observations.
Last observation carried forward: usually for longitudinal data
Imputation of missing data
- fill in the missing values
  - overall means
  - group means
  - prediction using models
- applies to the methods that are robust, but improves estimation
Multiple imputation fills the missing values by normal(mean, sd) values
- each fill constitute a complete data set
- final result combines the results of many complete data sets

Messy data

What can you tell about messy data?

Its messy, variables are hard to be determined.

Where is the data?

Which part of this is your data?

[1] "*$তথ্য*$%&*$кики%&%&19ки19#%/>[|]<7272%&#%তথ্য[|]>@*72#%#%*$<>72@*#%*$@*/*$<[|]кики><<19#%*$/#%#%*$<ки1972%&*$@**$@*19[|][|]%&киতথ্য/%&@*%&তথ্য@*//%&*$[|]/%&%&/ки<[|]%&90তথ্য<ки"

How can you extract date from this?

The meeting will held on Tuesday the 9th December, 2014. Next meeting is scheduled on Wednesday the 17th December 2014

Where is the data?

<input type="hidden" name="title" value="Special:SearchWiki">
<input type="hidden" name="uselang" value="en">
<input type="hidden" name="searchproject" value="p">

What are the variables?

time	Friday	Saturday	Sunday	Monday	Tuesday	Wednesday	Thursday
1	22	24	14	21	20	18	17
2	15	22	22	15	19	17	22
3	24	17	11	28	30	21	11
4	11	20	29	13	16	15	21
5	21	22	20	12	22	17	16
6	22	21	16	19	23	19	10
7	15	15	11	22	21	20	25
8	18	29	16	22	24	18	14
9	18	24	20	19	20	12	19
10	27	14	21	18	22	19	34

Here column headers are not variables, instead they are the values of one variable
There are only three variables here. They are time, days and values

What are the variables?

Delivery	Amount
On Sunday
10:30	43
12:30	12
12:35	30
On Monday
11:30	29
11:57	87
11.59	63
On Tuesday
11:33	19
11:15	27
12.59	54

What are the variables?

Delivery	Amount
On Sunday
10:30	43
12:30	12
12:35	30
On Monday
11:30	29
11:57	87
11.59	63
On Tuesday
11:33	19
11:15	27
12.59	54

Here multiple variables are stored in a single column. This Delivery column should be splitted into two columns such as Weekdays and time.

Days	times	Amount
Sunday	10:30	43
Sunday	12:30	12
Sunday	12:35	30
Monday	11:30	29
Monday	11:57	87
Monday	11.59	63
Tuesday	11:33	19
Tuesday	11:15	27
Tuesday	12.59	54

Tidying the messy data

Sometimes variables could be stored even in both rows and columns
Multiple types of experimental units could be stored in the same table
one type of experimental unit could be stored in multiple tables
Please refer to Hadley's lecture on tidy data
http://stat405.had.co.nz/lectures/18-tidy-data.pdf
For more details refer to the documentation of R package tidyR

install.packages("tidyr")
library(tidyr)

RStudio blog introducing tydyr package
http://blog.rstudio.org/2014/07/22/introducing-tidyr/

Massive amount of messy data

Large amount of data, no single machine can handle to clean
- use cluster computing
- hadoop & pig can help you
  - pig allows you to use regular expression
Problem of character encoding in text data
- we have seen in our last couple of HW
- you can use pig to clean by using regular expression

grunt> dat = LOAD 'input.txt' as myText;
grunt> out = FOREACH dat GENERATE REGEX_EXTRACT(myText,'(\\w+)',1);
grunt> dump out;

Reading assignment and references

R Little and D Rubin (2002) Statistical Analysis with Missing Data, Second Edition
Hadley's lecture on tidy data
http://stat405.had.co.nz/lectures/18-tidy-data.pdf
RStudio blog introducing tydyr package
http://blog.rstudio.org/2014/07/22/introducing-tidyr/
Demonstration of tidyr by Hadley
https://rpubs.com/m_dev/tidyr-intro-and-demos

time	Friday	Saturday	Sunday	Monday	Tuesday	Wednesday	Thursday
1	22	24	14	21	20	18	17
2	15	22	22	15	19	17	22
3	24	17	11	28	30	21	11
4	11	20	29	13	16	15	21
5	21	22	20	12	22	17	16
6	22	21	16	19	23	19	10
7	15	15	11	22	21	20	25
8	18	29	16	22	24	18	14
9	18	24	20	19	20	12	19
10	27	14	21	18	22	19	34

time	Friday	Saturday	Sunday	Monday	Tuesday	Wednesday	Thursday
1	22	24	14	21	20	18	17
2	15	22	22	15	19	17	22
3	24	17	11	28	30	21	11
4	11	20	29	13	16	15	21
5	21	22	20	12	22	17	16
6	22	21	16	19	23	19	10
7	15	15	11	22	21	20	25
8	18	29	16	22	24	18	14
9	18	24	20	19	20	12	19
10	27	14	21	18	22	19	34

time	Friday	Saturday	Sunday	Monday	Tuesday	Wednesday	Thursday
1	22	24	14	21	20	18	17
2	15	22	22	15	19	17	22
3	24	17	11	28	30	21	11
4	11	20	29	13	16	15	21
5	21	22	20	12	22	17	16
6	22	21	16	19	23	19	10
7	15	15	11	22	21	20	25
8	18	29	16	22	24	18	14
9	18	24	20	19	20	12	19
10	27	14	21	18	22	19	34