Mahbubul Majumder, PhD
Oct 2, 2014
Text data is ubiquitous
Why important
What are the challenges
s <- "myString"
is.character(s)
[1] TRUE
x <- 5
is.character(x)
[1] FALSE
y <- as.character(x)
is.character(y)
[1] TRUE
y
[1] "5"
s
to numeric?as.numeric(s)
[1] NA
as.numeric(y)
[1] 5
is.character
as.character
stringsAsFactors = FALSE
s <- c("myEssay","yourEssay")
nchar(s)
[1] 7 9
w <- "yourEssay"
w
[1] "yourEssay"
s %in% w
[1] FALSE TRUE
which(s %in% w)
[1] 2
L <- letters
L[1:3]
[1] "a" "b" "c"
length(L)
[1] 26
L[23:26]
[1] "w" "x" "y" "z"
which(L %in% c("w","x"))
[1] 23 24
st1 <- "My name"
st2 <- "is Mahbub"
paste(st1,st2)
[1] "My name is Mahbub"
paste("var", 1:3)
[1] "var 1" "var 2" "var 3"
paste("var", 1:3, sep="_")
[1] "var_1" "var_2" "var_3"
paste("Today is", date())
[1] "Today is Fri Oct 3 00:46:10 2014"
a <- 25
paste("square of",a,"is",a^2)
[1] "square of 25 is 625"
A <- letters[1:3]
B <- c(4, 6, 9)
paste(A, B)
[1] "a 4" "b 6" "c 9"
C <- paste(A, B, sep="")
C
[1] "a4" "b6" "c9"
nchar(C)=?
or length(C)=?
V <- paste("s",1:5, sep="")
V
[1] "s1" "s2" "s3" "s4" "s5"
nchar(V)
[1] 2 2 2 2 2
W <- paste(V, collapse="_")
W
[1] "s1_s2_s3_s4_s5"
nchar(W)
[1] 14
sp <- strsplit(W, split="_")
sp
[[1]]
[1] "s1" "s2" "s3" "s4" "s5"
unlist(sp)
[1] "s1" "s2" "s3" "s4" "s5"
unlist(strsplit(W,"s"))
[1] "" "1_" "2_" "3_" "4_" "5"
myText <- paste(letters, collapse="")
myText
[1] "abcdefghijklmnopqrstuvwxyz"
substr(myText, start=4, stop=16)
[1] "defghijklmnop"
myFacebookPost <- toupper(myText)
myFacebookPost
[1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
substr(myFacebookPost, start=16, stop=nchar(myFacebookPost))
[1] "PQRSTUVWXYZ"
myAllPosts <- c("Facebook Post", "Blog Post", "Cell Phone chat")
postLengths <- nchar(myAllPosts)
postLengths
[1] 13 9 15
substr(myAllPosts, 1, postLengths-5)
[1] "Facebook" "Blog" "Cell Phone"
substr(myAllPosts, postLengths-4, postLengths)
[1] " Post" " Post" " chat"
pattern
, replace by myString
, in a string vectorsub(" Post", "", myAllPosts)
[1] "Facebook" "Blog" "Cell Phone chat"
sub(" chat", "", myAllPosts)
[1] "Facebook Post" "Blog Post" "Cell Phone"
sub("o", "O", myAllPosts)
[1] "FacebOok Post" "BlOg Post" "Cell PhOne chat"
gsub("o", "O", myAllPosts)
[1] "FacebOOk POst" "BlOg POst" "Cell PhOne chat"
sub()
and gsub()
?Substitution requires searching the text and then replacing
What does the common pattern consist of?
0-9
a-Z
| / , @ * . _ ~ : )
A5b2@_.N
sub('@_.', '', 'A5b2@_.N')
[1] "A5b2N"
R
functions can handle regular expressions? regex
R
, extended and Perl likeR
functions like sub(), gsub(), strsplit()
and grep()
. \ | ( ) [ { ^ $ * + ?
$
Search the end of stringgsub('k$', '.', c('book', 'kook'))
[1] "boo." "koo."
^
Search the start of stringgsub('^k', '.', c('book', 'kook'))
[1] "book" ".ook"
.
Search k and any single character on the left of k (.
means any single character)gsub('.k', '_', c('book', 'kook'))
[1] "bo_" "ko_"
? * {n} {n,} {n,m}
?
The preceding item is optional and will be matched at most once. Here we are replacing pattern ogo
but made o
to be optional and does not have to be matched. gsub('ogo','.', c('google','logo','dig', 'blog', 'boogie' ))
[1] "google" "l." "dig" "blog" "boogie"
gsub('ogo?','.', c('google','logo','dig', 'blog', 'boogie' ))
[1] "go.le" "l." "dig" "bl." "bo.ie"
gsub('o?go?','.', c('google','logo','dig', 'blog', 'boogie' ))
[1] "..le" "l." "di." "bl." "bo.ie"
*
The preceding item will be matched zero or more times. Here we are replacing d
and left all the o
optionalgsub('do', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] ".odle" "ran.m" "so.oooooooooooo"
gsub('do*', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] "..le" "ran.m" "so."
+
The preceding item will be matched one or more times. Here we want to match do
and left other o
to be optionalgsub('do+', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] ".dle" "ran.m" "so."
{n}
The preceding item is matched exactly n times.gsub('do{10}', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] "doodle" "random" "so.ooo"
{n,}
The preceding item is matched n or more times.gsub('do{10,}', '.', c('doodle', 'random', 'sodooooooooooooo'))
[1] "doodle" "random" "so."
{n,m}
The preceding item is matched at least n times, but not more than m times.gsub('do{10,12}', '.',c('doodle', 'random', 'sodooooooooooooo'))
[1] "doodle" "random" "so.o"
()
gsub('good|boy', '.', 'good boy good boy')
[1] ". . . ."
gsub('good boy', '.', 'good boy good boy')
[1] ". ."
gsub('(good) (boy) \\1 \\2', '.', 'good boy good boy')
[1] "."
[]
to express one of the characters inside. [a-z0-9]
or [A-Z]
or [0-9]
gsub('[a-z]*[0-9]','found', c('mahbub2','unomaha4','lincoln'))
[1] "found" "found" "lincoln"
\\d
to express digit()
to express grouping. A
or B
replicated 3 times (A|B){3}
{}
to indicate quantifiers. 3 digit number \\d{3}
gsub('(\\d{3}|\\d{4})', 'found', c('23','345','14328','3456'))
[1] "23" "found" "found8" "found"
gsub('^\\d{3,4}$', 'found', c('23','345','14328','3456'))
[1] "23" "found" "14328" "found"
do*
where *
is a matacharacter. If we search it we don't get the desired result because *
has a special meaning.gsub('do*', '.', c('doodle', 'random', 'Rodo*mny'))
[1] "..le" "ran.m" "Ro.*mny"
*
and make it like a normal character. For this we use escape character \
gsub('do\\*', '.', c('doodle', 'random', 'Rodo*mny'))
[1] "doodle" "random" "Ro.mny"
\\
is necessary since we need to de-specialize special character \
too. Together \\
works as one escape character.gsub('a+','.','saaaaaaaata')
[1] "s.t."
gsub('\\d+','.','13acd123kk3')
[1] ".acd.kk."
gsub('\\d+?','.','13acd123kk3')
[1] "..acd...kk."
gsub('\\D+','.','13acd123kk3')
[1] "13.123.3"
gsub('.a','.','Apple rolled again and again')
[1] "Apple rolled..in.nd..in"
gsub('\\W','','Bro@wn%_fox@')
[1] "Brown_fox"
gsub('\\w','','Bro@wn%_fox@')
[1] "@%@"
gsub('State, [A-Z]{2}', '', 'State, NYState name')
[1] "State name"
ph <- c('402-554-2734','(515)-509-8354','56954', '34-3658',
'532-5542985','543689 9864')
gsub('\\(*\\d{3}\\)*( |-)*\\d{3}\\.*( |-)*\\d{4}', "yes", ph)
[1] "yes" "yes" "56954" "34-3658" "yes" "yes"
myEmail <- c('abc9@bb.com','ab_c.2@mm.net','bbc_ag72@kk')
gsub('[a-z0-9_.]+\\@.+\\.\\w+', "ok",myEmail)
[1] "ok" "ok" "bbc_ag72@kk"
[a-z0-9_.]+
means any alpha-numeric text could be any number of time\\@
after that there should be an @
.+
after that any text could be any number of time\\.
after that there should be a dot\\w+
after that any english letter could be any number of timeMeta | Description | Pattern | Match | notMatch |
---|---|---|---|---|
^ | match from the begining of the text | ^ab | able | table |
$ | match from the end of the text | ab$ | tab | table |
. | match any character | d.g | dog, dig | god |
* | match none or any number of time | ab* | ta, tab, tabbbbbb | tobbbbbb |
+ | match one or any number of time | ab+ | tab, tabbbbbb | ta |
? | pattern on the left is optional | ab? | a and ab | b |
{} | use to indicate replication | go{2} | good | god |
[ ] | use one of the characters inside | [ab] | gap, big | dig |
( ) | group a pattern | (ab){2} | abab | table |
| | match pattern on left or right of | | (ab)|c | able, cut | at |
\ | escape the power of metacharacter | d\ .g | d.g | dog |
\
character in R
we have to use double escape \\
\\w
matches any english character, \\W
matches any non-english character\\d
matches any digit such as a 0-9
, \\D
matches any non-digitA nice explanation of special characters in regular expression
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
A nice way of visualizing the regular expression. Notice that for R
we use double escape \\
but for this applet use single \
. Just type /
regular expression/
http://www.regexplained.co.uk/