Exploratory Data Visualization

Mahbubul Majumder, PhD
Sep 9, 2014

Exploratory data analysis (EDA)

  • Learn about underlying structure of the data

  • Identify the variables (could be hidden !)

  • Potential outliers or anomalies with the data

  • Missing or systematic pattern in the data

  • Discover the unexpected

  • How valuable the data is to answer the potential question

  • Identify features the model can't find

  • Identify whether assumptions of a model is satisfied

Tips data: a case study

  • Cook et. al.: Interactive and Dynamic Graphics for Data Analysis
  • Does the tip rate depend on the size?
tips <- read.csv("http://www.ggobi.org/book/data/tips.csv")
tips$rate <- with(tips,tip/totbill)
model <- lm(rate~size, data=tips)
summary(model)$coefficients
             Estimate Std. Error t value  Pr(>|t|)
(Intercept)  0.184375   0.011191  16.475 2.094e-41
size        -0.009173   0.004085  -2.245 2.565e-02
  • When the party size increases by 1 person, tip rate decreases by about 1%

    tip rate = 0.18 - 0.01 * size

Tips data: explore the tips

  • Let us explore the distribution of tips
library(ggplot2) 
ggplot(tips) +
  geom_histogram(aes(tip), binwidth=1) + 
  scale_x_continuous(
    breaks=seq(2,10,by=2))

plot of chunk unnamed-chunk-3

  • Now change the binwidth to .1
library(ggplot2)
ggplot(tips) + geom_histogram(aes(tip), binwidth=0.1)+ 
  scale_x_continuous(
    breaks=seq(2,10,by=2))

plot of chunk unnamed-chunk-4

Tips data: explore the relationship

  • Adding layer
ggplot(tips,
       aes(totbill,tip)) +
  geom_point()

plot of chunk unnamed-chunk-6

ggplot(tips,
       aes(totbill,tip)) +
  geom_point() + 
  geom_smooth(method="lm", se=F)

plot of chunk unnamed-chunk-7

Tips data: include more variables

ggplot(tips,
       aes(totbill,tip)) +
  geom_point() + 
  geom_smooth(method="lm", se=F) +
  facet_grid(sex~smoker)

plot of chunk unnamed-chunk-8

Tips data: fun in exploring

  • Exploratory study reveals more interesting features

  • Model was not giving those insights

  • How was it possible?

    • building a complex plot piece by piece
    • it did not take too much to get to next level of complexity
    • gives flexibilities in whatever next level could be
    • provides interface to include many other sets of data layers
    • there should be a standard procedure
  • let us learn it now or never

Grammar of graphics

  • Leland Wilkinson (1999)
    • grammar of graphics
  • Hadley Whickham (2005)
    • continued working with the grammar
    • implemented in R:
    • layered grammar of graphics (2010)
  install.packages("ggplot2")
  library(ggplot2)
  • poetry of graphics

gg Basics

grammar-graphics

Note: One variable could be mapped to multiple aesthetics

Creating a plot piece by piece

p <- ggplot(data=tips, aes(x=totbill,y=tip)) 
p + geom_point(size=3) + geom_smooth(method="lm", se=F)

plot of chunk unnamed-chunk-10

Creating a plot piece by piece

p <- ggplot(data=tips, aes(x=totbill,y=tip,color=sex)) 
p + geom_point(size=3) + geom_smooth(method="lm", se=F)

plot of chunk unnamed-chunk-11

Creating a plot piece by piece

p + geom_point(size=3) + geom_smooth(method="lm",se=F) + facet_wrap(~time)

plot of chunk unnamed-chunk-12

Some useful functions

  • Geometric functions:
    geom_point() geom_line() geom_bar() geom_histogram() geom_density() geom_boxplot()

  • Statistics:
    stat_smooth() stat_quantile()

  • Faceting:
    facet_wrap() facet_grid()

  • Scaling:
    scale_x_continuous() scale_y_continuous()

  • Theme:
    theme_bw() theme_gray()

  • For more please visit http://docs.ggplot2.org/current/index.html

Example: multilayer and theme

ggplot(tips, aes(totbill)) + 
  geom_histogram(aes(y = ..density..)) +
  geom_density(size=1.5, color="steelblue") +
  theme_bw(18)

plot of chunk unnamed-chunk-13

Example: legend and labels

ggplot(tips, aes(factor(size),tip,color=factor(size))) + 
  geom_boxplot(aes(fill=factor(size)), alpha=0.3) +
  xlab("Party size") +
  theme(legend.position = "none")

plot of chunk unnamed-chunk-14

Reading assignment and references