Working with text Data

Mahbubul Majumder, PhD
Oct 2, 2014

About text data

  • Text data is ubiquitous

    • web sites
    • blogs, social media
    • text message, email, chat
  • Why important

    • most of the unstructured data are text
  • What are the challenges

    • lot of meaningless text
    • not as nice as numbers
    • not easy to handle

Identifying and converting text

s <- "myString"
is.character(s)
[1] TRUE
x <- 5
is.character(x)
[1] FALSE
y <- as.character(x)
is.character(y)
[1] TRUE
y
[1] "5"
  • Can we convert s to numeric?
as.numeric(s)
[1] NA
  • We can only convert to numeric if it is numeric character
as.numeric(y)
[1] 5
  • To check the type is.character
  • To convert the type as.character
  • Note: data frame converts character to factor by default. To change default
    stringsAsFactors = FALSE

Exploring text

s <- c("myEssay","yourEssay")
nchar(s)
[1] 7 9
w <- "yourEssay"
w
[1] "yourEssay"
s %in% w
[1] FALSE  TRUE
which(s %in% w)
[1] 2
  • English letters
L <- letters
L[1:3]
[1] "a" "b" "c"
length(L)
[1] 26
L[23:26]
[1] "w" "x" "y" "z"
which(L %in% c("w","x"))
[1] 23 24

String concatenation

st1 <- "My name"
st2 <- "is Mahbub"
paste(st1,st2)
[1] "My name is Mahbub"
paste("var", 1:3)
[1] "var 1" "var 2" "var 3"
paste("var", 1:3, sep="_")
[1] "var_1" "var_2" "var_3"
paste("Today is", date())
[1] "Today is Fri Oct  3 00:46:10 2014"
a <- 25
paste("square of",a,"is",a^2)
[1] "square of 25 is 625"
A <- letters[1:3]
B <- c(4, 6, 9)
paste(A, B)
[1] "a 4" "b 6" "c 9"
C <- paste(A, B, sep="")
C
[1] "a4" "b6" "c9"
  • Quiz: What are the differences?
    nchar(C)=? or length(C)=?

Splitting and collapsing text

V <- paste("s",1:5, sep="")
V
[1] "s1" "s2" "s3" "s4" "s5"
nchar(V)
[1] 2 2 2 2 2
W <- paste(V, collapse="_")
W
[1] "s1_s2_s3_s4_s5"
nchar(W)
[1] 14
sp <- strsplit(W, split="_")
sp
[[1]]
[1] "s1" "s2" "s3" "s4" "s5"
unlist(sp)
[1] "s1" "s2" "s3" "s4" "s5"
unlist(strsplit(W,"s"))
[1] ""   "1_" "2_" "3_" "4_" "5" 

Extracting part of a text

myText <- paste(letters, collapse="")
myText
[1] "abcdefghijklmnopqrstuvwxyz"
substr(myText, start=4, stop=16)
[1] "defghijklmnop"
  • Get from specific position to end of the text
myFacebookPost <- toupper(myText)
myFacebookPost
[1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
substr(myFacebookPost, start=16, stop=nchar(myFacebookPost))
[1] "PQRSTUVWXYZ"

Extracting part of a text from a text vector

myAllPosts <- c("Facebook Post", "Blog Post", "Cell Phone chat")
postLengths <- nchar(myAllPosts)
postLengths
[1] 13  9 15
  • Get rid of the last 5 characters from each of the texts
substr(myAllPosts, 1, postLengths-5)
[1] "Facebook"   "Blog"       "Cell Phone"
  • What are those last 5 characters?
substr(myAllPosts, postLengths-4, postLengths)
[1] " Post" " Post" " chat"

Replacing or substituting strings

  • Search a pattern, replace by myString, in a string vector
sub(" Post", "", myAllPosts)
[1] "Facebook"        "Blog"            "Cell Phone chat"
sub(" chat", "", myAllPosts)
[1] "Facebook Post" "Blog Post"     "Cell Phone"   
sub("o", "O", myAllPosts)
[1] "FacebOok Post"   "BlOg Post"       "Cell PhOne chat"
gsub("o", "O", myAllPosts)
[1] "FacebOOk POst"   "BlOg POst"       "Cell PhOne chat"
  • What is the difference between sub() and gsub()?

Searching patterns in texts

  • Substitution requires searching the text and then replacing

  • What does the common pattern consist of?

    • numbers 0-9
    • letters a-Z
    • non-english characters | / , @ * . _ ~ : )
    • combinations A5b2@_.N
sub('@_.', '', 'A5b2@_.N')
[1] "A5b2N"
  • How can we search by not telling exact pattern?
    • regular expression
    • R functions can handle regular expressions

Regular expression

  • A regular expression is a pattern that describes a set of strings
    • ? regex
    • two types of regular expression in R, extended and Perl like
  • Extended regular expression
    • can be used in R functions like
      sub(), gsub(), strsplit() and grep()
    • metacharacters have special meaning
      . \ | ( ) [ { ^ $ * + ?
  • Perl like regular expression
    • syntax and semantics as Perl 5.10
    • little different

Searching patterns in the beginning or end

  • $ Search the end of string
gsub('k$', '.', c('book', 'kook'))
[1] "boo." "koo."
  • ^ Search the start of string
gsub('^k', '.', c('book', 'kook'))
[1] "book" ".ook"
  • . Search k and any single character on the left of k (. means any single character)
gsub('.k', '_', c('book', 'kook'))
[1] "bo_" "ko_"

Expressing repetition quantifiers

  • ? * {n} {n,} {n,m}
  • ? The preceding item is optional and will be matched at most once. Here we are replacing pattern ogo but made o to be optional and does not have to be matched.
gsub('ogo','.', c('google','logo','dig', 'blog', 'boogie' ))
[1] "google" "l."     "dig"    "blog"   "boogie"
gsub('ogo?','.', c('google','logo','dig', 'blog', 'boogie' ))
[1] "go.le" "l."    "dig"   "bl."   "bo.ie"
gsub('o?go?','.', c('google','logo','dig', 'blog', 'boogie' ))
[1] "..le"  "l."    "di."   "bl."   "bo.ie"

Expressing repetition quantifiers * vs +

  • * The preceding item will be matched zero or more times. Here we are replacing d and left all the o optional
gsub('do', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] ".odle"           "ran.m"           "so.oooooooooooo"
gsub('do*', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] "..le"  "ran.m" "so."  
  • + The preceding item will be matched one or more times. Here we want to match do and left other o to be optional
gsub('do+', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] ".dle"  "ran.m" "so."  

Expressing repetition quantifiers {n}

  • {n} The preceding item is matched exactly n times.
gsub('do{10}', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] "doodle" "random" "so.ooo"
  • {n,} The preceding item is matched n or more times.
gsub('do{10,}', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] "doodle" "random" "so."   
  • {n,m} The preceding item is matched at least n times, but not more than m times.
gsub('do{10,12}', '.',c('doodle', 'random', 'sodooooooooooooo'))
[1] "doodle" "random" "so.o"  

Grouping the pattern

  • Capturing parenthesis ()
gsub('good|boy', '.', 'good boy good boy')
[1] ". . . ."
gsub('good boy', '.', 'good boy good boy')
[1] ". ."
gsub('(good) (boy) \\1 \\2', '.', 'good boy good boy')
[1] "."

Regular expression using matacharacter

  • [] to express one of the characters inside. [a-z0-9] or [A-Z] or [0-9]
gsub('[a-z]*[0-9]','found', c('mahbub2','unomaha4','lincoln'))
[1] "found"   "found"   "lincoln"
  • \\d to express digit
  • () to express grouping. A or B replicated 3 times (A|B){3}
  • {} to indicate quantifiers. 3 digit number \\d{3}
  • Checking 3 or 4 digit numbers (Only 3 or 4 digit numbers ?)
gsub('(\\d{3}|\\d{4})', 'found', c('23','345','14328','3456'))
[1] "23"     "found"  "found8" "found" 
gsub('^\\d{3,4}$', 'found', c('23','345','14328','3456'))
[1] "23"    "found" "14328" "found"

Searching patterns containing matacharacter

  • We want to replace do* where * is a matacharacter. If we search it we don't get the desired result because * has a special meaning.
gsub('do*', '.', c('doodle', 'random', 'Rodo*mny'))
[1] "..le"    "ran.m"   "Ro.*mny"
  • We need to take back the specialty of * and make it like a normal character. For this we use escape character \
gsub('do\\*', '.', c('doodle', 'random', 'Rodo*mny'))
[1] "doodle" "random" "Ro.mny"
  • The double escape \\ is necessary since we need to de-specialize special character \ too. Together \\ works as one escape character.

More examples of regular expression

gsub('a+','.','saaaaaaaata') 
[1] "s.t."
gsub('\\d+','.','13acd123kk3') 
[1] ".acd.kk."
gsub('\\d+?','.','13acd123kk3') 
[1] "..acd...kk."
gsub('\\D+','.','13acd123kk3') 
[1] "13.123.3"
gsub('.a','.','Apple rolled again and again')  
[1] "Apple rolled..in.nd..in"
  • English and non-English character
gsub('\\W','','Bro@wn%_fox@') 
[1] "Brown_fox"
gsub('\\w','','Bro@wn%_fox@')  
[1] "@%@"
gsub('State, [A-Z]{2}', '', 'State, NYState name')
[1] "State name"

Matching valid USA phone or email address

ph <- c('402-554-2734','(515)-509-8354','56954', '34-3658',
        '532-5542985','543689 9864')

gsub('\\(*\\d{3}\\)*( |-)*\\d{3}\\.*( |-)*\\d{4}', "yes", ph)
[1] "yes"     "yes"     "56954"   "34-3658" "yes"     "yes"    
myEmail <- c('abc9@bb.com','ab_c.2@mm.net','bbc_ag72@kk')

gsub('[a-z0-9_.]+\\@.+\\.\\w+', "ok",myEmail)
[1] "ok"          "ok"          "bbc_ag72@kk"
  • How it works
    • [a-z0-9_.]+ means any alpha-numeric text could be any number of time
    • \\@ after that there should be an @
    • .+ after that any text could be any number of time
    • \\. after that there should be a dot
    • \\w+ after that any english letter could be any number of time

Metacharacters at a glance

Meta Description Pattern Match notMatch
^ match from the begining of the text ^ab able table
$ match from the end of the text ab$ tab table
. match any character d.g dog, dig god
* match none or any number of time ab* ta, tab, tabbbbbb tobbbbbb
+ match one or any number of time ab+ tab, tabbbbbb ta
? pattern on the left is optional ab? a and ab b
{} use to indicate replication go{2} good god
[ ] use one of the characters inside [ab] gap, big dig
( ) group a pattern (ab){2} abab table
| match pattern on left or right of | (ab)|c able, cut at
\ escape the power of metacharacter d\ .g d.g dog
  • When we use escape \ character in R we have to use double escape \\
  • \\w matches any english character, \\W matches any non-english character
  • \\d matches any digit such as a 0-9, \\D matches any non-digit

Reading assignment and references