Mahbubul Majumder, PhD
Oct 2, 2014
Text data is ubiquitous
Why important
What are the challenges
s <- "myString"
is.character(s)
[1] TRUE
x <- 5
is.character(x)
[1] FALSE
y <- as.character(x)
is.character(y)
[1] TRUE
y
[1] "5"
s to numeric?as.numeric(s)
[1] NA
as.numeric(y)
[1] 5
is.characteras.characterstringsAsFactors = FALSEs <- c("myEssay","yourEssay")
nchar(s)
[1] 7 9
w <- "yourEssay"
w
[1] "yourEssay"
s %in% w
[1] FALSE  TRUE
which(s %in% w)
[1] 2
L <- letters
L[1:3]
[1] "a" "b" "c"
length(L)
[1] 26
L[23:26]
[1] "w" "x" "y" "z"
which(L %in% c("w","x"))
[1] 23 24
st1 <- "My name"
st2 <- "is Mahbub"
paste(st1,st2)
[1] "My name is Mahbub"
paste("var", 1:3)
[1] "var 1" "var 2" "var 3"
paste("var", 1:3, sep="_")
[1] "var_1" "var_2" "var_3"
paste("Today is", date())
[1] "Today is Fri Oct  3 00:46:10 2014"
a <- 25
paste("square of",a,"is",a^2)
[1] "square of 25 is 625"
A <- letters[1:3]
B <- c(4, 6, 9)
paste(A, B)
[1] "a 4" "b 6" "c 9"
C <- paste(A, B, sep="")
C
[1] "a4" "b6" "c9"
nchar(C)=? or length(C)=?V <- paste("s",1:5, sep="")
V
[1] "s1" "s2" "s3" "s4" "s5"
nchar(V)
[1] 2 2 2 2 2
W <- paste(V, collapse="_")
W
[1] "s1_s2_s3_s4_s5"
nchar(W)
[1] 14
sp <- strsplit(W, split="_")
sp
[[1]]
[1] "s1" "s2" "s3" "s4" "s5"
unlist(sp)
[1] "s1" "s2" "s3" "s4" "s5"
unlist(strsplit(W,"s"))
[1] ""   "1_" "2_" "3_" "4_" "5" 
myText <- paste(letters, collapse="")
myText
[1] "abcdefghijklmnopqrstuvwxyz"
substr(myText, start=4, stop=16)
[1] "defghijklmnop"
myFacebookPost <- toupper(myText)
myFacebookPost
[1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
substr(myFacebookPost, start=16, stop=nchar(myFacebookPost))
[1] "PQRSTUVWXYZ"
myAllPosts <- c("Facebook Post", "Blog Post", "Cell Phone chat")
postLengths <- nchar(myAllPosts)
postLengths
[1] 13  9 15
substr(myAllPosts, 1, postLengths-5)
[1] "Facebook"   "Blog"       "Cell Phone"
substr(myAllPosts, postLengths-4, postLengths)
[1] " Post" " Post" " chat"
pattern, replace by myString, in a string vectorsub(" Post", "", myAllPosts)
[1] "Facebook"        "Blog"            "Cell Phone chat"
sub(" chat", "", myAllPosts)
[1] "Facebook Post" "Blog Post"     "Cell Phone"   
sub("o", "O", myAllPosts)
[1] "FacebOok Post"   "BlOg Post"       "Cell PhOne chat"
gsub("o", "O", myAllPosts)
[1] "FacebOOk POst"   "BlOg POst"       "Cell PhOne chat"
sub() and gsub()?Substitution requires searching the text and then replacing
What does the common pattern consist of?
0-9a-Z| / , @ * . _ ~ : )A5b2@_.Nsub('@_.', '', 'A5b2@_.N')
[1] "A5b2N"
R functions can handle regular expressions? regexR, extended and Perl likeR functions like sub(), gsub(), strsplit() and grep() . \ | ( ) [ { ^ $ * + ?$ Search the end of stringgsub('k$', '.', c('book', 'kook'))
[1] "boo." "koo."
^ Search the start of stringgsub('^k', '.', c('book', 'kook'))
[1] "book" ".ook"
. Search k and any single character on the left of k (. means any single character)gsub('.k', '_', c('book', 'kook'))
[1] "bo_" "ko_"
? * {n} {n,} {n,m}? The preceding item is optional and will be matched at most once. Here we are replacing pattern ogo but made o to be optional and does not have to be matched. gsub('ogo','.', c('google','logo','dig', 'blog', 'boogie' ))
[1] "google" "l."     "dig"    "blog"   "boogie"
gsub('ogo?','.', c('google','logo','dig', 'blog', 'boogie' ))
[1] "go.le" "l."    "dig"   "bl."   "bo.ie"
gsub('o?go?','.', c('google','logo','dig', 'blog', 'boogie' ))
[1] "..le"  "l."    "di."   "bl."   "bo.ie"
* The preceding item will be matched zero or more times. Here we are replacing d and left all the o optionalgsub('do', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] ".odle"           "ran.m"           "so.oooooooooooo"
gsub('do*', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] "..le"  "ran.m" "so."  
+ The preceding item will be matched one or more times. Here we want to match do and left other o to be optionalgsub('do+', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] ".dle"  "ran.m" "so."  
{n} The preceding item is matched exactly n times.gsub('do{10}', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] "doodle" "random" "so.ooo"
{n,} The preceding item is matched n or more times.gsub('do{10,}', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] "doodle" "random" "so."   
{n,m} The preceding item is matched at least n times, but not more than m times.gsub('do{10,12}', '.',c('doodle', 'random', 'sodooooooooooooo'))
[1] "doodle" "random" "so.o"  
()gsub('good|boy', '.', 'good boy good boy')
[1] ". . . ."
gsub('good boy', '.', 'good boy good boy')
[1] ". ."
gsub('(good) (boy) \\1 \\2', '.', 'good boy good boy')
[1] "."
[] to express one of the characters inside. [a-z0-9] or [A-Z] or [0-9]gsub('[a-z]*[0-9]','found', c('mahbub2','unomaha4','lincoln'))
[1] "found"   "found"   "lincoln"
\\d to express digit() to express grouping. A or B replicated 3 times (A|B){3}{} to indicate quantifiers. 3 digit number \\d{3}gsub('(\\d{3}|\\d{4})', 'found', c('23','345','14328','3456'))
[1] "23"     "found"  "found8" "found" 
gsub('^\\d{3,4}$', 'found', c('23','345','14328','3456'))
[1] "23"    "found" "14328" "found"
do* where * is a matacharacter. If we search it we don't get the desired result because * has a special meaning.gsub('do*', '.', c('doodle', 'random', 'Rodo*mny'))
[1] "..le"    "ran.m"   "Ro.*mny"
* and make it like a normal character. For this we use escape character \gsub('do\\*', '.', c('doodle', 'random', 'Rodo*mny'))
[1] "doodle" "random" "Ro.mny"
\\ is necessary since we need to de-specialize special character \ too. Together \\ works as one escape character.gsub('a+','.','saaaaaaaata') 
[1] "s.t."
gsub('\\d+','.','13acd123kk3') 
[1] ".acd.kk."
gsub('\\d+?','.','13acd123kk3') 
[1] "..acd...kk."
gsub('\\D+','.','13acd123kk3') 
[1] "13.123.3"
gsub('.a','.','Apple rolled again and again')  
[1] "Apple rolled..in.nd..in"
gsub('\\W','','Bro@wn%_fox@') 
[1] "Brown_fox"
gsub('\\w','','Bro@wn%_fox@')  
[1] "@%@"
gsub('State, [A-Z]{2}', '', 'State, NYState name')
[1] "State name"
ph <- c('402-554-2734','(515)-509-8354','56954', '34-3658',
        '532-5542985','543689 9864')
gsub('\\(*\\d{3}\\)*( |-)*\\d{3}\\.*( |-)*\\d{4}', "yes", ph)
[1] "yes"     "yes"     "56954"   "34-3658" "yes"     "yes"    
myEmail <- c('abc9@bb.com','ab_c.2@mm.net','bbc_ag72@kk')
gsub('[a-z0-9_.]+\\@.+\\.\\w+', "ok",myEmail)
[1] "ok"          "ok"          "bbc_ag72@kk"
[a-z0-9_.]+ means any alpha-numeric text could be any number of time\\@ after that there should be an @.+ after that any text could be any number of time\\. after that there should be a dot\\w+ after that any english letter could be any number of time| Meta | Description | Pattern | Match | notMatch | 
|---|---|---|---|---|
| ^ | match from the begining of the text | ^ab | able | table | 
| $ | match from the end of the text | ab$ | tab | table | 
| . | match any character | d.g | dog, dig | god | 
| * | match none or any number of time | ab* | ta, tab, tabbbbbb | tobbbbbb | 
| + | match one or any number of time | ab+ | tab, tabbbbbb | ta | 
| ? | pattern on the left is optional | ab? | a and ab | b | 
| {} | use to indicate replication | go{2} | good | god | 
| [ ] | use one of the characters inside | [ab] | gap, big | dig | 
| ( ) | group a pattern | (ab){2} | abab | table | 
| | | match pattern on left or right of | | (ab)|c | able, cut | at | 
| \ | escape the power of metacharacter | d\ .g | d.g | dog | 
\ character in R we have to use double escape \\\\w matches any english character, \\W matches any non-english character\\d matches any digit such as a 0-9, \\D matches any non-digitA nice explanation of special characters in regular expression 
 https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
A nice way of visualizing the regular expression. Notice that for R we use double escape \\ but for this applet use single \. Just type /regular expression/ 
 http://www.regexplained.co.uk/