Mahbubul Majumder, PhD
Oct 2, 2014
Text data is ubiquitous
Why important
What are the challenges
s <- "myString"
is.character(s)
[1] TRUE
x <- 5
is.character(x)
[1] FALSE
y <- as.character(x)
is.character(y)
[1] TRUE
y
[1] "5"
s to numeric?as.numeric(s)
[1] NA
as.numeric(y)
[1] 5
is.characteras.characterstringsAsFactors = FALSEs <- c("myEssay","yourEssay")
nchar(s)
[1] 7 9
w <- "yourEssay"
w
[1] "yourEssay"
s %in% w
[1] FALSE TRUE
which(s %in% w)
[1] 2
L <- letters
L[1:3]
[1] "a" "b" "c"
length(L)
[1] 26
L[23:26]
[1] "w" "x" "y" "z"
which(L %in% c("w","x"))
[1] 23 24
st1 <- "My name"
st2 <- "is Mahbub"
paste(st1,st2)
[1] "My name is Mahbub"
paste("var", 1:3)
[1] "var 1" "var 2" "var 3"
paste("var", 1:3, sep="_")
[1] "var_1" "var_2" "var_3"
paste("Today is", date())
[1] "Today is Fri Oct 3 00:46:10 2014"
a <- 25
paste("square of",a,"is",a^2)
[1] "square of 25 is 625"
A <- letters[1:3]
B <- c(4, 6, 9)
paste(A, B)
[1] "a 4" "b 6" "c 9"
C <- paste(A, B, sep="")
C
[1] "a4" "b6" "c9"
nchar(C)=? or length(C)=?V <- paste("s",1:5, sep="")
V
[1] "s1" "s2" "s3" "s4" "s5"
nchar(V)
[1] 2 2 2 2 2
W <- paste(V, collapse="_")
W
[1] "s1_s2_s3_s4_s5"
nchar(W)
[1] 14
sp <- strsplit(W, split="_")
sp
[[1]]
[1] "s1" "s2" "s3" "s4" "s5"
unlist(sp)
[1] "s1" "s2" "s3" "s4" "s5"
unlist(strsplit(W,"s"))
[1] "" "1_" "2_" "3_" "4_" "5"
myText <- paste(letters, collapse="")
myText
[1] "abcdefghijklmnopqrstuvwxyz"
substr(myText, start=4, stop=16)
[1] "defghijklmnop"
myFacebookPost <- toupper(myText)
myFacebookPost
[1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
substr(myFacebookPost, start=16, stop=nchar(myFacebookPost))
[1] "PQRSTUVWXYZ"
myAllPosts <- c("Facebook Post", "Blog Post", "Cell Phone chat")
postLengths <- nchar(myAllPosts)
postLengths
[1] 13 9 15
substr(myAllPosts, 1, postLengths-5)
[1] "Facebook" "Blog" "Cell Phone"
substr(myAllPosts, postLengths-4, postLengths)
[1] " Post" " Post" " chat"
pattern, replace by myString, in a string vectorsub(" Post", "", myAllPosts)
[1] "Facebook" "Blog" "Cell Phone chat"
sub(" chat", "", myAllPosts)
[1] "Facebook Post" "Blog Post" "Cell Phone"
sub("o", "O", myAllPosts)
[1] "FacebOok Post" "BlOg Post" "Cell PhOne chat"
gsub("o", "O", myAllPosts)
[1] "FacebOOk POst" "BlOg POst" "Cell PhOne chat"
sub() and gsub()?Substitution requires searching the text and then replacing
What does the common pattern consist of?
0-9a-Z| / , @ * . _ ~ : )A5b2@_.Nsub('@_.', '', 'A5b2@_.N')
[1] "A5b2N"
R functions can handle regular expressions? regexR, extended and Perl likeR functions like sub(), gsub(), strsplit() and grep() . \ | ( ) [ { ^ $ * + ?$ Search the end of stringgsub('k$', '.', c('book', 'kook'))
[1] "boo." "koo."
^ Search the start of stringgsub('^k', '.', c('book', 'kook'))
[1] "book" ".ook"
. Search k and any single character on the left of k (. means any single character)gsub('.k', '_', c('book', 'kook'))
[1] "bo_" "ko_"
? * {n} {n,} {n,m}? The preceding item is optional and will be matched at most once. Here we are replacing pattern ogo but made o to be optional and does not have to be matched. gsub('ogo','.', c('google','logo','dig', 'blog', 'boogie' ))
[1] "google" "l." "dig" "blog" "boogie"
gsub('ogo?','.', c('google','logo','dig', 'blog', 'boogie' ))
[1] "go.le" "l." "dig" "bl." "bo.ie"
gsub('o?go?','.', c('google','logo','dig', 'blog', 'boogie' ))
[1] "..le" "l." "di." "bl." "bo.ie"
* The preceding item will be matched zero or more times. Here we are replacing d and left all the o optionalgsub('do', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] ".odle" "ran.m" "so.oooooooooooo"
gsub('do*', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] "..le" "ran.m" "so."
+ The preceding item will be matched one or more times. Here we want to match do and left other o to be optionalgsub('do+', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] ".dle" "ran.m" "so."
{n} The preceding item is matched exactly n times.gsub('do{10}', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] "doodle" "random" "so.ooo"
{n,} The preceding item is matched n or more times.gsub('do{10,}', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] "doodle" "random" "so."
{n,m} The preceding item is matched at least n times, but not more than m times.gsub('do{10,12}', '.',c('doodle', 'random', 'sodooooooooooooo'))
[1] "doodle" "random" "so.o"
()gsub('good|boy', '.', 'good boy good boy')
[1] ". . . ."
gsub('good boy', '.', 'good boy good boy')
[1] ". ."
gsub('(good) (boy) \\1 \\2', '.', 'good boy good boy')
[1] "."
[] to express one of the characters inside. [a-z0-9] or [A-Z] or [0-9]gsub('[a-z]*[0-9]','found', c('mahbub2','unomaha4','lincoln'))
[1] "found" "found" "lincoln"
\\d to express digit() to express grouping. A or B replicated 3 times (A|B){3}{} to indicate quantifiers. 3 digit number \\d{3}gsub('(\\d{3}|\\d{4})', 'found', c('23','345','14328','3456'))
[1] "23" "found" "found8" "found"
gsub('^\\d{3,4}$', 'found', c('23','345','14328','3456'))
[1] "23" "found" "14328" "found"
do* where * is a matacharacter. If we search it we don't get the desired result because * has a special meaning.gsub('do*', '.', c('doodle', 'random', 'Rodo*mny'))
[1] "..le" "ran.m" "Ro.*mny"
* and make it like a normal character. For this we use escape character \gsub('do\\*', '.', c('doodle', 'random', 'Rodo*mny'))
[1] "doodle" "random" "Ro.mny"
\\ is necessary since we need to de-specialize special character \ too. Together \\ works as one escape character.gsub('a+','.','saaaaaaaata')
[1] "s.t."
gsub('\\d+','.','13acd123kk3')
[1] ".acd.kk."
gsub('\\d+?','.','13acd123kk3')
[1] "..acd...kk."
gsub('\\D+','.','13acd123kk3')
[1] "13.123.3"
gsub('.a','.','Apple rolled again and again')
[1] "Apple rolled..in.nd..in"
gsub('\\W','','Bro@wn%_fox@')
[1] "Brown_fox"
gsub('\\w','','Bro@wn%_fox@')
[1] "@%@"
gsub('State, [A-Z]{2}', '', 'State, NYState name')
[1] "State name"
ph <- c('402-554-2734','(515)-509-8354','56954', '34-3658',
'532-5542985','543689 9864')
gsub('\\(*\\d{3}\\)*( |-)*\\d{3}\\.*( |-)*\\d{4}', "yes", ph)
[1] "yes" "yes" "56954" "34-3658" "yes" "yes"
myEmail <- c('abc9@bb.com','ab_c.2@mm.net','bbc_ag72@kk')
gsub('[a-z0-9_.]+\\@.+\\.\\w+', "ok",myEmail)
[1] "ok" "ok" "bbc_ag72@kk"
[a-z0-9_.]+ means any alpha-numeric text could be any number of time\\@ after that there should be an @.+ after that any text could be any number of time\\. after that there should be a dot\\w+ after that any english letter could be any number of time| Meta | Description | Pattern | Match | notMatch |
|---|---|---|---|---|
| ^ | match from the begining of the text | ^ab | able | table |
| $ | match from the end of the text | ab$ | tab | table |
| . | match any character | d.g | dog, dig | god |
| * | match none or any number of time | ab* | ta, tab, tabbbbbb | tobbbbbb |
| + | match one or any number of time | ab+ | tab, tabbbbbb | ta |
| ? | pattern on the left is optional | ab? | a and ab | b |
| {} | use to indicate replication | go{2} | good | god |
| [ ] | use one of the characters inside | [ab] | gap, big | dig |
| ( ) | group a pattern | (ab){2} | abab | table |
| | | match pattern on left or right of | | (ab)|c | able, cut | at |
| \ | escape the power of metacharacter | d\ .g | d.g | dog |
\ character in R we have to use double escape \\\\w matches any english character, \\W matches any non-english character\\d matches any digit such as a 0-9, \\D matches any non-digitA nice explanation of special characters in regular expression
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
A nice way of visualizing the regular expression. Notice that for R we use double escape \\ but for this applet use single \. Just type /regular expression/
http://www.regexplained.co.uk/