Working with text Data

Mahbubul Majumder, PhD
Oct 2, 2014

About text data

Text data is ubiquitous
- web sites
- blogs, social media
- text message, email, chat
Why important
- most of the unstructured data are text
What are the challenges
- lot of meaningless text
- not as nice as numbers
- not easy to handle

Identifying and converting text

s <- "myString"
is.character(s)

[1] TRUE

x <- 5
is.character(x)

[1] FALSE

y <- as.character(x)
is.character(y)

[1] TRUE

[1] "5"

Can we convert s to numeric?

as.numeric(s)

[1] NA

We can only convert to numeric if it is numeric character

as.numeric(y)

[1] 5

To check the type is.character
To convert the type as.character
Note: data frame converts character to factor by default. To change default
stringsAsFactors = FALSE

Exploring text

s <- c("myEssay","yourEssay")
nchar(s)

[1] 7 9

w <- "yourEssay"
w

[1] "yourEssay"

s %in% w

[1] FALSE  TRUE

which(s %in% w)

[1] 2

English letters

L <- letters
L[1:3]

[1] "a" "b" "c"

length(L)

[1] 26

L[23:26]

[1] "w" "x" "y" "z"

which(L %in% c("w","x"))

[1] 23 24

String concatenation

st1 <- "My name"
st2 <- "is Mahbub"
paste(st1,st2)

[1] "My name is Mahbub"

paste("var", 1:3)

[1] "var 1" "var 2" "var 3"

paste("var", 1:3, sep="_")

[1] "var_1" "var_2" "var_3"

paste("Today is", date())

[1] "Today is Fri Oct  3 00:46:10 2014"

a <- 25
paste("square of",a,"is",a^2)

[1] "square of 25 is 625"

A <- letters[1:3]
B <- c(4, 6, 9)
paste(A, B)

[1] "a 4" "b 6" "c 9"

C <- paste(A, B, sep="")
C

[1] "a4" "b6" "c9"

Quiz: What are the differences?
nchar(C)=? or length(C)=?

Splitting and collapsing text

V <- paste("s",1:5, sep="")
V

[1] "s1" "s2" "s3" "s4" "s5"

nchar(V)

[1] 2 2 2 2 2

W <- paste(V, collapse="_")
W

[1] "s1_s2_s3_s4_s5"

nchar(W)

[1] 14

sp <- strsplit(W, split="_")
sp

[[1]]
[1] "s1" "s2" "s3" "s4" "s5"

unlist(sp)

[1] "s1" "s2" "s3" "s4" "s5"

unlist(strsplit(W,"s"))

[1] ""   "1_" "2_" "3_" "4_" "5"

Extracting part of a text

myText <- paste(letters, collapse="")
myText

[1] "abcdefghijklmnopqrstuvwxyz"

substr(myText, start=4, stop=16)

[1] "defghijklmnop"

Get from specific position to end of the text

myFacebookPost <- toupper(myText)
myFacebookPost

[1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

substr(myFacebookPost, start=16, stop=nchar(myFacebookPost))

[1] "PQRSTUVWXYZ"

Extracting part of a text from a text vector

myAllPosts <- c("Facebook Post", "Blog Post", "Cell Phone chat")
postLengths <- nchar(myAllPosts)
postLengths

[1] 13  9 15

Get rid of the last 5 characters from each of the texts

substr(myAllPosts, 1, postLengths-5)

[1] "Facebook"   "Blog"       "Cell Phone"

What are those last 5 characters?

substr(myAllPosts, postLengths-4, postLengths)

[1] " Post" " Post" " chat"

Replacing or substituting strings

Search a pattern, replace by myString, in a string vector

sub(" Post", "", myAllPosts)

[1] "Facebook"        "Blog"            "Cell Phone chat"

sub(" chat", "", myAllPosts)

[1] "Facebook Post" "Blog Post"     "Cell Phone"

sub("o", "O", myAllPosts)

[1] "FacebOok Post"   "BlOg Post"       "Cell PhOne chat"

gsub("o", "O", myAllPosts)

[1] "FacebOOk POst"   "BlOg POst"       "Cell PhOne chat"

What is the difference between sub() and gsub()?

Searching patterns in texts

Substitution requires searching the text and then replacing
What does the common pattern consist of?
- numbers 0-9
- letters a-Z
- non-english characters | / , @ * . _ ~ : )
- combinations A5b2@_.N

sub('@_.', '', 'A5b2@_.N')

[1] "A5b2N"

How can we search by not telling exact pattern?
- regular expression
- R functions can handle regular expressions

Regular expression

A regular expression is a pattern that describes a set of strings
- ? regex
- two types of regular expression in R, extended and Perl like
Extended regular expression
- can be used in R functions like
  sub(), gsub(), strsplit() and grep()
- metacharacters have special meaning
  . \ | ( ) [ { ^ $ * + ?
Perl like regular expression
- syntax and semantics as Perl 5.10
- little different

Searching patterns in the beginning or end

$ Search the end of string

gsub('k$', '.', c('book', 'kook'))

[1] "boo." "koo."

^ Search the start of string

gsub('^k', '.', c('book', 'kook'))

[1] "book" ".ook"

. Search k and any single character on the left of k (. means any single character)

gsub('.k', '_', c('book', 'kook'))

[1] "bo_" "ko_"

Expressing repetition quantifiers

? * {n} {n,} {n,m}
? The preceding item is optional and will be matched at most once. Here we are replacing pattern ogo but made o to be optional and does not have to be matched.

gsub('ogo','.', c('google','logo','dig', 'blog', 'boogie' ))

[1] "google" "l."     "dig"    "blog"   "boogie"

gsub('ogo?','.', c('google','logo','dig', 'blog', 'boogie' ))

[1] "go.le" "l."    "dig"   "bl."   "bo.ie"

gsub('o?go?','.', c('google','logo','dig', 'blog', 'boogie' ))

[1] "..le"  "l."    "di."   "bl."   "bo.ie"

Expressing repetition quantifiers * vs +

* The preceding item will be matched zero or more times. Here we are replacing d and left all the o optional

gsub('do', '.', c('doodle', 'random', 'sodooooooooooooo'))

[1] ".odle"           "ran.m"           "so.oooooooooooo"

gsub('do*', '.', c('doodle', 'random', 'sodooooooooooooo'))

[1] "..le"  "ran.m" "so."

+ The preceding item will be matched one or more times. Here we want to match do and left other o to be optional

gsub('do+', '.', c('doodle', 'random', 'sodooooooooooooo'))

[1] ".dle"  "ran.m" "so."

Expressing repetition quantifiers {n}

{n} The preceding item is matched exactly n times.

gsub('do{10}', '.', c('doodle', 'random', 'sodooooooooooooo'))

[1] "doodle" "random" "so.ooo"

{n,} The preceding item is matched n or more times.

gsub('do{10,}', '.', c('doodle', 'random', 'sodooooooooooooo'))

[1] "doodle" "random" "so."

{n,m} The preceding item is matched at least n times, but not more than m times.

gsub('do{10,12}', '.',c('doodle', 'random', 'sodooooooooooooo'))

[1] "doodle" "random" "so.o"

Grouping the pattern

Capturing parenthesis ()

gsub('good|boy', '.', 'good boy good boy')

[1] ". . . ."

gsub('good boy', '.', 'good boy good boy')

[1] ". ."

gsub('(good) (boy) \\1 \\2', '.', 'good boy good boy')

[1] "."

Regular expression using matacharacter

[] to express one of the characters inside. [a-z0-9] or [A-Z] or [0-9]

gsub('[a-z]*[0-9]','found', c('mahbub2','unomaha4','lincoln'))

[1] "found"   "found"   "lincoln"

\\d to express digit
() to express grouping. A or B replicated 3 times (A|B){3}
{} to indicate quantifiers. 3 digit number \\d{3}
Checking 3 or 4 digit numbers (Only 3 or 4 digit numbers ?)

gsub('(\\d{3}|\\d{4})', 'found', c('23','345','14328','3456'))

[1] "23"     "found"  "found8" "found"

gsub('^\\d{3,4}$', 'found', c('23','345','14328','3456'))

[1] "23"    "found" "14328" "found"

Searching patterns containing matacharacter

We want to replace do* where * is a matacharacter. If we search it we don't get the desired result because * has a special meaning.

gsub('do*', '.', c('doodle', 'random', 'Rodo*mny'))

[1] "..le"    "ran.m"   "Ro.*mny"

We need to take back the specialty of * and make it like a normal character. For this we use escape character \

gsub('do\\*', '.', c('doodle', 'random', 'Rodo*mny'))

[1] "doodle" "random" "Ro.mny"

The double escape \\ is necessary since we need to de-specialize special character \ too. Together \\ works as one escape character.

More examples of regular expression

gsub('a+','.','saaaaaaaata')

[1] "s.t."

gsub('\\d+','.','13acd123kk3')

[1] ".acd.kk."

gsub('\\d+?','.','13acd123kk3')

[1] "..acd...kk."

gsub('\\D+','.','13acd123kk3')

[1] "13.123.3"

gsub('.a','.','Apple rolled again and again')

[1] "Apple rolled..in.nd..in"

English and non-English character

gsub('\\W','','Bro@wn%_fox@')

[1] "Brown_fox"

gsub('\\w','','Bro@wn%_fox@')

[1] "@%@"

gsub('State, [A-Z]{2}', '', 'State, NYState name')

[1] "State name"

Matching valid USA phone or email address

ph <- c('402-554-2734','(515)-509-8354','56954', '34-3658',
        '532-5542985','543689 9864')

gsub('\\(*\\d{3}\\)*( |-)*\\d{3}\\.*( |-)*\\d{4}', "yes", ph)

[1] "yes"     "yes"     "56954"   "34-3658" "yes"     "yes"

myEmail <- c('abc9@bb.com','ab_c.2@mm.net','bbc_ag72@kk')

gsub('[a-z0-9_.]+\\@.+\\.\\w+', "ok",myEmail)

[1] "ok"          "ok"          "bbc_ag72@kk"

How it works
- [a-z0-9_.]+ means any alpha-numeric text could be any number of time
- \\@ after that there should be an @
- .+ after that any text could be any number of time
- \\. after that there should be a dot
- \\w+ after that any english letter could be any number of time

Metacharacters at a glance

Meta	Description	Pattern	Match	notMatch
^	match from the begining of the text	^ab	able	table
$	match from the end of the text	ab$	tab	table
.	match any character	d.g	dog, dig	god
*	match none or any number of time	ab*	ta, tab, tabbbbbb	tobbbbbb
+	match one or any number of time	ab+	tab, tabbbbbb	ta
?	pattern on the left is optional	ab?	a and ab	b
{}	use to indicate replication	go{2}	good	god
[ ]	use one of the characters inside	[ab]	gap, big	dig
( )	group a pattern	(ab){2}	abab	table
\|	match pattern on left or right of \|	(ab)\|c	able, cut	at
\	escape the power of metacharacter	d\ .g	d.g	dog

When we use escape \ character in R we have to use double escape \\
\\w matches any english character, \\W matches any non-english character
\\d matches any digit such as a 0-9, \\D matches any non-digit

Reading assignment and references

A nice explanation of special characters in regular expression
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
A nice way of visualizing the regular expression. Notice that for R we use double escape \\ but for this applet use single \. Just type /regular expression/
http://www.regexplained.co.uk/