Manipulation and visualization of text Data

Mahbubul Majumder, PhD
Oct 7, 2014

Manipulation of text

  • Regular expression helps to search for text

  • How can we use it conveniently ?

install.packages("stringr")
library(stringr)
  • Makes string manipulation easy and convenient

    • easy to understand
    • syntax is intuitive and easy to remember
      str_locate() str_detect() str_extract()

Detecting matched text

pattern <- 'o?go?'
myText <- c('google','logo','dig', 'blog', 'boogie', 'God' )

library(stringr)

str_detect(myText, pattern)
[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
str_detect(myText, fixed(pattern))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
str_detect(myText, ignore.case(pattern))
[1] TRUE TRUE TRUE TRUE TRUE TRUE

Extracting matched text

pattern <- 'o?go?'
myText <- c('google','logo','dig', 'blog', 'boogie')
str_extract(myText, pattern)
[1] "go"  "ogo" "g"   "og"  "og" 
str_extract_all(myText, pattern)
[[1]]
[1] "go" "og"

[[2]]
[1] "ogo"

[[3]]
[1] "g"

[[4]]
[1] "og"

[[5]]
[1] "og"

Extracting email address

eText <- 'abc9@bb.com bla bla ab_c.2@mm.net bla bla bbc_ag72@kk'

Pt <- '[_a-z0-9-]+(\\.[_a-z0-9-]+)*\\@[_a-z0-9-]+\\.[_a-z0-9-]+'

str_extract(eText, Pt)  
[1] "abc9@bb.com"
elist <- str_extract_all(eText, Pt)
elist
[[1]]
[1] "abc9@bb.com"   "ab_c.2@mm.net"
  • Above output is a list. We like them to be unlisted. Also, we want unique address.
unique(unlist(elist))
[1] "abc9@bb.com"   "ab_c.2@mm.net"

Extracting words from text

  • Anchors in Regular Expressions
  • Start and end anchor ^ and $. Pattern should exist between first and last character.
myNum <- c(20, 152)
str_extract(myNum,'^\\d{2}$')
[1] "20" NA  
str_extract(myNum,'\\d{2}')
[1] "20" "15"
  • Word boundary anchor \\b, without this code will just look for any 4 characters
X <- 'My @@** Ball $$ rolls'
str_extract(X,'.{4}')
[1] "My @"
  • When word boundary anchor is added it will match only the words
str_extract(X,'\\b.{4}\\b')
[1] "Ball"
str_extract_all(X, '(\\w+)')
[[1]]
[1] "My"    "Ball"  "rolls"

Stop words

  • Words that are not interesting enough
    • most common words
    • but could be interesting sometimes
## install.packages("tm")
library(tm)
  • Function stopwords() lists the stop words
stopW <- stopwords()
head(stopW)
[1] "i"      "me"     "my"     "myself" "we"     "our"   
myText <- 'This is a wonderful example for the class'
unlist(strsplit(myText,split = " ")) %in% stopW
[1] FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE

Displaying text data

  • Usually remove stop words, we have to be cautious though

  • Display frequency of the words

    • bar chart of frequent words
    • dot chart could be an alternative
    • size of the words proportional to count
    • time series plot of two or three most frequent words
  • Grouping the related text and display the cluster

    • related could be defined in many ways
    • other variables can be used
    • english meaning or use of words can create relation
  • Word cloud display

Classification of text documents

  • Based on the words we may be able to classify documents

    • speech
    • technical or essay
    • novel or poetry
    • style of a specific writer
  • Relation from use of words together (Topic)

    • in a neighborhood or side by side
    • may explain some group or a subject matter
    • associated word frequency in sports (hit ball) or in politics (hit back)
  • Correlation of some words together

    • correlation plot showing topics related to each other
    • topic cloud instead of word cloud
    • topic network graph

Obama inaugural speech: A case study

Obama inaugural speech: word cloud

library(wordcloud)
wordcloud(cText, scale=c(8,0.5), max.words=150, 
          random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, 'Dark2'))

plot of chunk unnamed-chunk-13

wordle by Jonathan Feinberg

  • A nice java applet wordle

    • just paste your raw texts and it will produce nice plot of words
    • settings can be changed like color, font or orientation
    • you can embed the applet in your web
  • Can make multiple words as a group using quote '' ''

    • “make up” “fill in” phrases could be displayed together
  • It is hard to determine which word should be used as 'stop word'

  • Be cautious using wordle

    • color does not mean anything and sometimes may distract
    • think of other alternative that may deliver a message

Obama inaugural speech: A wordle plot

obama-speech-wordle-plot

Romeo and Juliet: A case study

romeo-juliet

Relationship between most frequent words

  • How Shakespeare mentioned Love and Romeo together

plot of chunk unnamed-chunk-14

Reading assignment and references