14  Regex Functions in R

In the previous chapters we talked about regex functions available in the package "stringr". In this chapter we are going to describe more regular expression functions but this time from the "base" package (i.e. native regex functions in R).

14.1 Pattern Finding Functions

Let’s begin by reviewing the first five grep()-like functions grep(), grepl(), regexpr(), gregexpr(), and regexec(). The goal is the same for all these functions: finding a match. The difference between them is in the format of the output. Essentially these functions require two main arguments: a pattern (i.e. regular expression), and a text to match. The basic usage for these functions is:

 grep(pattern, text)
 grepl(pattern, text)
 regexpr(pattern, text)
 gregexpr(pattern, text)
 regexec(pattern, text)

Each function has other additional arguments but the important thing to keep in mind are a pattern and some text.

14.1.1 Function grep()

grep() is perhaps the most basic functions that allows us to match a pattern in a string vector. The first argument in grep() is a regular expression that specifies the pattern to match. The second argument is a character vector with the text strings on which to search. The output is the indices of the elements of the text vector for which there is a match. If no matches are found, the output is an empty integer vector.

# some text
text <- c("one word", "a sentence", "you and me", "three two one")

# pattern
pat <- "one"

# default usage
grep(pat, text)
[1] 1 4

As you can tell from the output in the previous example, grep() returns a numeric vector. This indicates that the 1st and 4th elements contained a match. In contrast, the 2nd and the 3rd elements did not.

We can use the argument value to modify the way in which the output is presented. If we choose value = TRUE, instead of returning the indices, grep() returns the content of the string vector:

# with 'value' (showing matched text)
grep(pat, text, value = TRUE)
[1] "one word"      "three two one"

Another interesting argument to play with is invert. We can use this parameter to obtain unmatches strings by setting its value to TRUE

# with 'invert' (showing unmatched parts)
grep(pat, text, invert = TRUE)
[1] 2 3
# same with 'values'
grep(pat, text, invert = TRUE, value = TRUE)
[1] "a sentence" "you and me"

In summary, grep() can be used to subset a character vector to get only the elements containing (or not containing) the matched pattern.

14.1.2 Function grepl()

The function grepl() enables us to perform a similar task as grep(). The difference resides in that the output are not numeric indices, but logical (TRUE / FALSE). Hence you can think of grepl() as grep-logical. Using the same text string of the previous examples, here’s the behavior of grepl():

# some text
text <- c("one word", "a sentence", "you and me", "three two one")

# pattern
pat <- "one"

# default usage
grepl(pat, text)
[1]  TRUE FALSE FALSE  TRUE

Note that we get a logical vector of the same length as the character vector. Those elements that matched the pattern have a value of TRUE; those that didn’t match the pattern have a value of FALSE.

14.1.3 Function regexpr()

To find exactly where the pattern is found in a given string, we can use the regexpr() function. This function returns more detailed information than grep() providing us:

  1. which elements of the text vector actually contain the regex pattern, and

  2. identifies the position of the substring that is matched by the regular expression pattern

# some text
text <- c("one word", "a sentence", "you and me", "three two one")

# default usage
regexpr("one", text)
[1]  1 -1 -1 11
attr(,"match.length")
[1]  3 -1 -1  3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

At first glance the output from regexpr() may look a bit messy but it’s very simple to interpret. What we have in the output are three displayed elements. The first element is an integer vector of the same length as text giving the starting positions of the first match. In this example the number 1 indicates that the pattern "one" starts at the position 1 of the first element in text. The negative index -1 means that there was no match; the number 11 indicates the position of the substring that was matched in the fourth element of text.

The attribute "match.length" gives us the length of the match in each element of text. Again, a negative value of -1 means that there was no match in that element. Finally, the attribute "useBytes" has a value of TRUE which means that the matching was done byte-by-byte rather than character-by-character.

14.1.4 Function gregexpr()

The function gregexpr() does practically the same thing as regexpr(): identify where a pattern is within a string vector, by searching each element separately. The only difference is that gregexpr() has an output in the form of a list. In other words, gregexpr() returns a list of the same length as text, each element of which is of the same form as the return value for regexpr(), except that the starting positions of every (disjoint) match are given.

# some text
text <- c("one word", "a sentence", "you and me", "three two one")

# pattern
pat <- "one"

# default usage
gregexpr(pat, text)
[[1]]
[1] 1
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[4]]
[1] 11
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

14.1.5 Function regexec()

The function regexec() is very close to gregexpr() in the sense that the output is also a list of the same length as text. Each element of the list contains the starting position of the match. A value of -1 reflects that there is no match. In addition, each element of the list has the attribute "match.length" giving the lengths of the matches (or -1 for no match):

# some text
text <- c("one word", "a sentence", "you and me", "three two one")

# pattern
pat <- "one"

# default usage
regexec(pat, text)
[[1]]
[1] 1
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[4]]
[1] 11
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

14.2 Pattern Replacement Functions

Sometimes finding a pattern in a given string vector is all we want. However, there are occasions in which we might also be interested in replacing one pattern with another one. For this purpose we can use the substitution functions sub() and gsub(). The difference between sub() and gsub() is that the former replaces only the first occurrence of a pattern whereas the latter replaces all occurrences.

The replacement functions require three main arguments: a regex pattern to be matched, a replacement for the matched pattern, and the text where matches are sought. The basic usage is:

sub(pattern, replacement, text)
gsub(pattern, replacement, text)

14.2.1 Replacing first occurrence with sub()

The function sub() replaces the first occurrence of a pattern in a given text. This means that if there is more than one occurrence of the pattern in each element of a string vector, only the first one will be replaced. For example, suppose we have the following text vector containing various strings:

 Rstring = c("The R Foundation", 
             "for Statistical Computing", 
             "R is FREE software",
             "R is a collaborative project")

Imagine that our aim is to replace the pattern "R" with a new pattern "RR". If you use sub() this is what we obtain:

# string
Rstring <- c("The R Foundation", 
            "for Statistical Computing", 
            "R is FREE software",
            "R is a collaborative project")

# substitute 'R' with 'RR'
sub("R", "RR", Rstring)
[1] "The RR Foundation"             "for Statistical Computing"    
[3] "RR is FREE software"           "RR is a collaborative project"

As you can tell, only the first occurrence of the letter R is replaced in each element of the text vector. Note that the word FREE in the third element also contains an R but it was not replaced. This is because it was not the first occurrence of the pattern.

14.2.2 Replacing all occurrences with gsub()

To replace not only the first pattern occurrence, but all of the occurrences we should use gsub() (think of it as general or global substition). If we take the same vector Rstring and patterns of the last example, this is what we obtain when we apply gsub()

# string
Rstring <- c("The R Foundation", 
            "for Statistical Computing", 
            "R is FREE software",
            "R is a collaborative project")

# substitute
gsub("R", "RR", Rstring)
[1] "The RR Foundation"             "for Statistical Computing"    
[3] "RR is FRREE software"          "RR is a collaborative project"

The obtained output is almost the same as with sub(), except for the third element in Rstring. Now the occurrence of R in the word FREE is taken into account and gsub() changes it to FRREE.

14.3 Splitting Character Vectors

Besides the operations of finding patterns and replacing patterns, another common task is splitting a string based on a pattern. To do this R comes with the function strsplit() which is designed to split the elements of a character vector into substrings according to regex matches.

If you check the help documentation—help(strsplit)—you will see that the basic usage of strsplit() requires two main arguments:

 strsplit(x, split)

x is the character vector and split is the regular expression pattern. However, in order to keep the same notation that we’ve been using with the other grep() functions, it is better if we think of x as text, and split as pattern. In this way we can express the usage of strsplit() as:

 strsplit(text, pattern)

One of the typical tasks in which we can use strsplit() is when we want to break a string into individual components (i.e. words). For instance, if we wish to separate each word within a given sentence, we can do that specifying a blank space " " as splitting pattern:

# a sentence
sentence <- c("R is a collaborative project with many contributors")

# split into words
strsplit(sentence, " ")
[[1]]
[1] "R"             "is"            "a"             "collaborative"
[5] "project"       "with"          "many"          "contributors" 

Another basic example may consist in breaking apart the portions of a telephone number by splitting those sets of digits joined by a dash "-"

# telephone numbers
tels <- c("510-548-2238", "707-231-2440", "650-752-1300")

# split each number into its portions
strsplit(tels, "-")
[[1]]
[1] "510"  "548"  "2238"

[[2]]
[1] "707"  "231"  "2440"

[[3]]
[1] "650"  "752"  "1300"