Handling missing and messy data

Mahbubul Majumder, PhD
Dec 9, 2014

Missing data

  • If some of the values of a variable are not available, we call it missing
    • very common even in a big data scenario
    • missing data occurs when
      • no information is found
      • there exists recording error
      • data was not collected properly
      • data is reshaped (what does it mean?)

Identifying missing data

  • Data looks good. Is there any missing value here?
 time sex dose
    1   M    3
    1   F    5
    2   M   10
    2   F   11
    3   F   10
  • Now let us create a two way table of doses to see if the data is complete.
library(reshape2)
dcast(dat,  time ~ sex)
  time  F  M
1    1  5  3
2    2 11 10
3    3 10 NA

Types of missing data

  • Missing Not At Random (MNAR)

    • salary information is not answered by all (why? the reason is not collected)
    • usually missing information are dependent on unobserved variables
      • a disease information could be missing due to social pressure
  • Missing At Random (MAR)

    • missing may depend on other observed variables
    • males may not like to report certain information
  • completely random (MCAR)

    • does not usually depend on the other variables

Consequence of missing data

  • Missing event, when not at random (MNAR) may be informative

    • gives information about unobserved information
    • should not be ignored
    • indicates the data may not be a representative sample
    • results could be biased if ignored
  • We can ignore missing records if it is MAR or MCAR

    • in fact most of the methods assume MAR
    • since MAR depends on the observed data, given the observed data missing is independent of any unobserved information

Dealing with missing data

  • Perform a complete case analysis. Ignore or Delete all missing observations.

  • Last observation carried forward: usually for longitudinal data

  • Imputation of missing data

    • fill in the missing values
      • overall means
      • group means
      • prediction using models
    • applies to the methods that are robust, but improves estimation
  • Multiple imputation fills the missing values by normal(mean, sd) values

    • each fill constitute a complete data set
    • final result combines the results of many complete data sets

Messy data



What can you tell about messy data?



Its messy, variables are hard to be determined.

Where is the data?

  • Which part of this is your data?
[1] "*$তথ্য*$%&*$кики%&%&19ки19#%/>[|]<7272%&#%তথ্য[|]>@*72#%#%*$<>72@*#%*$@*/*$<[|]кики><<19#%*$/#%#%*$<ки1972%&*$@**$@*19[|][|]%&киতথ্য/%&@*%&তথ্য@*//%&*$[|]/%&%&/ки<[|]%&90তথ্য<ки"
  • How can you extract date from this?
The meeting will held on Tuesday the 9th December, 2014. Next meeting is scheduled on Wednesday the 17th December 2014
  • Where is the data?
<input type="hidden" name="title" value="Special:SearchWiki">
<input type="hidden" name="uselang" value="en">
<input type="hidden" name="searchproject" value="p">

What are the variables?

time Friday Saturday Sunday Monday Tuesday Wednesday Thursday
1 22 24 14 21 20 18 17
2 15 22 22 15 19 17 22
3 24 17 11 28 30 21 11
4 11 20 29 13 16 15 21
5 21 22 20 12 22 17 16
6 22 21 16 19 23 19 10
7 15 15 11 22 21 20 25
8 18 29 16 22 24 18 14
9 18 24 20 19 20 12 19
10 27 14 21 18 22 19 34
  • Here column headers are not variables, instead they are the values of one variable
    There are only three variables here. They are time, days and values

What are the variables?

Delivery Amount
On Sunday
10:30 43
12:30 12
12:35 30
On Monday
11:30 29
11:57 87
11.59 63
On Tuesday
11:33 19
11:15 27
12.59 54

What are the variables?

Delivery Amount
On Sunday
10:30 43
12:30 12
12:35 30
On Monday
11:30 29
11:57 87
11.59 63
On Tuesday
11:33 19
11:15 27
12.59 54
  • Here multiple variables are stored in a single column. This Delivery column should be splitted into two columns such as Weekdays and time.
Days times Amount
Sunday 10:30 43
Sunday 12:30 12
Sunday 12:35 30
Monday 11:30 29
Monday 11:57 87
Monday 11.59 63
Tuesday 11:33 19
Tuesday 11:15 27
Tuesday 12.59 54

Tidying the messy data

  • Sometimes variables could be stored even in both rows and columns
  • Multiple types of experimental units could be stored in the same table
  • one type of experimental unit could be stored in multiple tables
  • Please refer to Hadley's lecture on tidy data
    http://stat405.had.co.nz/lectures/18-tidy-data.pdf

  • For more details refer to the documentation of R package tidyR

install.packages("tidyr")
library(tidyr)

Massive amount of messy data

  • Large amount of data, no single machine can handle to clean

    • use cluster computing
    • hadoop & pig can help you
      • pig allows you to use regular expression
  • Problem of character encoding in text data

    • we have seen in our last couple of HW
    • you can use pig to clean by using regular expression
grunt> dat = LOAD 'input.txt' as myText;
grunt> out = FOREACH dat GENERATE REGEX_EXTRACT(myText,'(\\w+)',1);
grunt> dump out;

Reading assignment and references