Mahbubul Majumder, PhD
Oct 7, 2014
Regular expression helps to search for text
How can we use it conveniently ?
install.packages("stringr")
library(stringr)
Makes string manipulation easy and convenient
str_locate() str_detect() str_extract()
pattern <- 'o?go?'
myText <- c('google','logo','dig', 'blog', 'boogie', 'God' )
library(stringr)
str_detect(myText, pattern)
[1] TRUE TRUE TRUE TRUE TRUE FALSE
str_detect(myText, fixed(pattern))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
str_detect(myText, ignore.case(pattern))
[1] TRUE TRUE TRUE TRUE TRUE TRUE
pattern <- 'o?go?'
myText <- c('google','logo','dig', 'blog', 'boogie')
str_extract(myText, pattern)
[1] "go" "ogo" "g" "og" "og"
str_extract_all(myText, pattern)
[[1]]
[1] "go" "og"
[[2]]
[1] "ogo"
[[3]]
[1] "g"
[[4]]
[1] "og"
[[5]]
[1] "og"
eText <- 'abc9@bb.com bla bla ab_c.2@mm.net bla bla bbc_ag72@kk'
Pt <- '[_a-z0-9-]+(\\.[_a-z0-9-]+)*\\@[_a-z0-9-]+\\.[_a-z0-9-]+'
str_extract(eText, Pt)
[1] "abc9@bb.com"
elist <- str_extract_all(eText, Pt)
elist
[[1]]
[1] "abc9@bb.com" "ab_c.2@mm.net"
unique(unlist(elist))
[1] "abc9@bb.com" "ab_c.2@mm.net"
^
and $
. Pattern should exist between first and last character.myNum <- c(20, 152)
str_extract(myNum,'^\\d{2}$')
[1] "20" NA
str_extract(myNum,'\\d{2}')
[1] "20" "15"
\\b
, without this code will just look for any 4 charactersX <- 'My @@** Ball $$ rolls'
str_extract(X,'.{4}')
[1] "My @"
str_extract(X,'\\b.{4}\\b')
[1] "Ball"
str_extract_all(X, '(\\w+)')
[[1]]
[1] "My" "Ball" "rolls"
## install.packages("tm")
library(tm)
stopwords()
lists the stop wordsstopW <- stopwords()
head(stopW)
[1] "i" "me" "my" "myself" "we" "our"
myText <- 'This is a wonderful example for the class'
unlist(strsplit(myText,split = " ")) %in% stopW
[1] FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
Usually remove stop words, we have to be cautious though
Display frequency of the words
Grouping the related text and display the cluster
Word cloud display
Based on the words we may be able to classify documents
Relation from use of words together (Topic)
Correlation of some words together
library(wordcloud)
wordcloud(cText, scale=c(8,0.5), max.words=150,
random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, 'Dark2'))
A nice java applet wordle
Can make multiple words as a group using quote '' ''
It is hard to determine which word should be used as 'stop word'
Be cautious using wordle
Jonathan Feinberg's website to create wordle plot
http://www.wordle.net/
Interesting to learn how wordle is created
http://static.mrfeinberg.com/bv_ch03.pdf
Text is beautiful
http://textisbeautiful.net/