# some text
<- c("one word", "a sentence", "you and me", "three two one")
text
# pattern
<- "one"
pat
# default usage
grep(pat, text)
[1] 1 4
In the previous chapters we talked about regex functions available in the package "stringr"
. In this chapter we are going to describe more regular expression functions but this time from the "base"
package (i.e. native regex functions in R).
Let’s begin by reviewing the first five grep()
-like functions grep()
, grepl()
, regexpr()
, gregexpr()
, and regexec()
. The goal is the same for all these functions: finding a match. The difference between them is in the format of the output. Essentially these functions require two main arguments: a pattern (i.e. regular expression), and a text to match. The basic usage for these functions is:
grep(pattern, text)
grepl(pattern, text)
regexpr(pattern, text)
gregexpr(pattern, text)
regexec(pattern, text)
Each function has other additional arguments but the important thing to keep in mind are a pattern and some text.
grep()
grep()
is perhaps the most basic functions that allows us to match a pattern in a string vector. The first argument in grep()
is a regular expression that specifies the pattern to match. The second argument is a character vector with the text strings on which to search. The output is the indices of the elements of the text vector for which there is a match. If no matches are found, the output is an empty integer vector.
# some text
<- c("one word", "a sentence", "you and me", "three two one")
text
# pattern
<- "one"
pat
# default usage
grep(pat, text)
[1] 1 4
As you can tell from the output in the previous example, grep()
returns a numeric vector. This indicates that the 1st and 4th elements contained a match. In contrast, the 2nd and the 3rd elements did not.
We can use the argument value
to modify the way in which the output is presented. If we choose value = TRUE
, instead of returning the indices, grep()
returns the content of the string vector:
# with 'value' (showing matched text)
grep(pat, text, value = TRUE)
[1] "one word" "three two one"
Another interesting argument to play with is invert
. We can use this parameter to obtain unmatches strings by setting its value to TRUE
# with 'invert' (showing unmatched parts)
grep(pat, text, invert = TRUE)
[1] 2 3
# same with 'values'
grep(pat, text, invert = TRUE, value = TRUE)
[1] "a sentence" "you and me"
In summary, grep()
can be used to subset a character vector to get only the elements containing (or not containing) the matched pattern.
grepl()
The function grepl()
enables us to perform a similar task as grep()
. The difference resides in that the output are not numeric indices, but logical (TRUE
/ FALSE
). Hence you can think of grepl()
as grep
-logical. Using the same text string of the previous examples, here’s the behavior of grepl()
:
# some text
<- c("one word", "a sentence", "you and me", "three two one")
text
# pattern
<- "one"
pat
# default usage
grepl(pat, text)
[1] TRUE FALSE FALSE TRUE
Note that we get a logical vector of the same length as the character vector. Those elements that matched the pattern have a value of TRUE
; those that didn’t match the pattern have a value of FALSE
.
regexpr()
To find exactly where the pattern is found in a given string, we can use the regexpr()
function. This function returns more detailed information than grep()
providing us:
which elements of the text vector actually contain the regex pattern, and
identifies the position of the substring that is matched by the regular expression pattern
# some text
<- c("one word", "a sentence", "you and me", "three two one")
text
# default usage
regexpr("one", text)
[1] 1 -1 -1 11
attr(,"match.length")
[1] 3 -1 -1 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
At first glance the output from regexpr()
may look a bit messy but it’s very simple to interpret. What we have in the output are three displayed elements. The first element is an integer vector of the same length as text giving the starting positions of the first match. In this example the number 1 indicates that the pattern "one"
starts at the position 1 of the first element in text
. The negative index -1
means that there was no match; the number 11 indicates the position of the substring that was matched in the fourth element of text
.
The attribute "match.length"
gives us the length of the match in each element of text
. Again, a negative value of -1
means that there was no match in that element. Finally, the attribute "useBytes"
has a value of TRUE
which means that the matching was done byte-by-byte rather than character-by-character.
gregexpr()
The function gregexpr()
does practically the same thing as regexpr()
: identify where a pattern is within a string vector, by searching each element separately. The only difference is that gregexpr()
has an output in the form of a list. In other words, gregexpr()
returns a list of the same length as text
, each element of which is of the same form as the return value for regexpr()
, except that the starting positions of every (disjoint) match are given.
# some text
<- c("one word", "a sentence", "you and me", "three two one")
text
# pattern
<- "one"
pat
# default usage
gregexpr(pat, text)
[[1]]
[1] 1
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[4]]
[1] 11
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
regexec()
The function regexec()
is very close to gregexpr()
in the sense that the output is also a list of the same length as text
. Each element of the list contains the starting position of the match. A value of -1
reflects that there is no match. In addition, each element of the list has the attribute "match.length"
giving the lengths of the matches (or -1 for no match):
# some text
<- c("one word", "a sentence", "you and me", "three two one")
text
# pattern
<- "one"
pat
# default usage
regexec(pat, text)
[[1]]
[1] 1
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[4]]
[1] 11
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
Sometimes finding a pattern in a given string vector is all we want. However, there are occasions in which we might also be interested in replacing one pattern with another one. For this purpose we can use the substitution functions sub()
and gsub()
. The difference between sub()
and gsub()
is that the former replaces only the first occurrence of a pattern whereas the latter replaces all occurrences.
The replacement functions require three main arguments: a regex pattern
to be matched, a replacement
for the matched pattern, and the text
where matches are sought. The basic usage is:
sub(pattern, replacement, text)
gsub(pattern, replacement, text)
sub()
The function sub()
replaces the first occurrence of a pattern in a given text. This means that if there is more than one occurrence of the pattern in each element of a string vector, only the first one will be replaced. For example, suppose we have the following text vector containing various strings:
Rstring = c("The R Foundation",
"for Statistical Computing",
"R is FREE software",
"R is a collaborative project")
Imagine that our aim is to replace the pattern "R"
with a new pattern "RR"
. If you use sub()
this is what we obtain:
# string
<- c("The R Foundation",
Rstring "for Statistical Computing",
"R is FREE software",
"R is a collaborative project")
# substitute 'R' with 'RR'
sub("R", "RR", Rstring)
[1] "The RR Foundation" "for Statistical Computing"
[3] "RR is FREE software" "RR is a collaborative project"
As you can tell, only the first occurrence of the letter R
is replaced in each element of the text vector. Note that the word FREE
in the third element also contains an R
but it was not replaced. This is because it was not the first occurrence of the pattern.
gsub()
To replace not only the first pattern occurrence, but all of the occurrences we should use gsub()
(think of it as general or global substition). If we take the same vector Rstring
and patterns of the last example, this is what we obtain when we apply gsub()
# string
<- c("The R Foundation",
Rstring "for Statistical Computing",
"R is FREE software",
"R is a collaborative project")
# substitute
gsub("R", "RR", Rstring)
[1] "The RR Foundation" "for Statistical Computing"
[3] "RR is FRREE software" "RR is a collaborative project"
The obtained output is almost the same as with sub()
, except for the third element in Rstring
. Now the occurrence of R
in the word FREE
is taken into account and gsub()
changes it to FRREE
.
Besides the operations of finding patterns and replacing patterns, another common task is splitting a string based on a pattern. To do this R comes with the function strsplit()
which is designed to split the elements of a character vector into substrings according to regex matches.
If you check the help documentation—help(strsplit)—you will see that the basic usage of strsplit()
requires two main arguments:
strsplit(x, split)
x
is the character vector and split
is the regular expression pattern. However, in order to keep the same notation that we’ve been using with the other grep()
functions, it is better if we think of x
as text
, and split
as pattern
. In this way we can express the usage of strsplit()
as:
strsplit(text, pattern)
One of the typical tasks in which we can use strsplit()
is when we want to break a string into individual components (i.e. words). For instance, if we wish to separate each word within a given sentence, we can do that specifying a blank space " "
as splitting pattern:
# a sentence
<- c("R is a collaborative project with many contributors")
sentence
# split into words
strsplit(sentence, " ")
[[1]]
[1] "R" "is" "a" "collaborative"
[5] "project" "with" "many" "contributors"
Another basic example may consist in breaking apart the portions of a telephone number by splitting those sets of digits joined by a dash "-"
# telephone numbers
<- c("510-548-2238", "707-231-2440", "650-752-1300")
tels
# split each number into its portions
strsplit(tels, "-")
[[1]]
[1] "510" "548" "2238"
[[2]]
[1] "707" "231" "2440"
[[3]]
[1] "650" "752" "1300"