13  Regex Functions in "stringr"

In the previous chapters we talked about regular expressions in general; we discussed the particular way in which R works with regex patterns; and we quickly presented some functions to manipulate strings with regular expressions. In this chapter we are going to describe in more detail the functions for regular expressions available in both the "stringr" package.

As you know, we have already presented some of the functions in the R package "stringr" for regular expressions. As we mentioned, all these functions share a common usage structure:

 str_function(string, pattern)

The main two arguments are: a string vector to be processed, and a single pattern (i.e. regular expression) to match. Moreover, all the function names begin with the prefix str_, followed by the name of the action to be performed. For example, to locate the position of the first occurrence, we should use str_locate(); to locate the positions of all matches we should use str_locate_all().

13.1 Detecting patterns with str_detect()

For detecting whether a pattern is present (or absent) in a string vector, we can use the function str_detect(). Actually, this function is a wraper of grepl():

# some objects
some_objs <- c("pen", "pencil", "marker", "spray")

# detect phones
str_detect(some_objs, "pen")
[1]  TRUE  TRUE FALSE FALSE
# select detected macthes
some_objs[str_detect(some_objs, "pen")]
[1] "pen"    "pencil"

As you can see, the output of str_detect() is a boolean vector (TRUE/FALSE) of the same length as the specified string. You get a TRUE if a match is detected in a string, FALSE otherwise. Here’s another more elaborated example in which the pattern matches dates of the form day-month-year:

# some strings
strings <- c("12 Jun 2002", " 8 September 2004 ", "22-July-2009 ", 
            "01 01 2001", "date", "02.06.2000", 
            "xxx-yyy-zzzz", "$2,600")

# date pattern (month as text)
dates = "([0-9]{1,2})[- .]([a-zA-Z]+)[- .]([0-9]{4})"

# detect dates
str_detect(strings, dates)
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

13.2 Extract first match with str_extract()

For extracting a string containing a pattern, we can use the function str_extract(). In fact, this function extracts the first piece of a string that matches a given pattern. For example, imagine that we have a character vector with some tweets about Paris, and that we want to extract the hashtags. We can do this simply by defining a #hashtag pattern like #[a-zA-Z]{1}

# tweets about 'Paris'
paris_tweets <- c(
  "#Paris is chock-full of cultural and culinary attractions",
  "Some time in #Paris along Canal St.-Martin famous by #Amelie",
  "While you're in #Paris, stop at cafe: http://goo.gl/yaCbW",
  "Paris, the city of light")

# hashtag pattern
hash <- "#[a-zA-Z]{1,}"

# extract (first) hashtag
str_extract(paris_tweets, hash)
[1] "#Paris" "#Paris" "#Paris" NA      

As you can tell, the output of str_extract() is a vector of same length as string. Those elements that don’t match the pattern are indicated as NA. Note that str_extract() only matches the first pattern: it didn’t extract the hashtag "#Amelie".

13.3 Extract all matches with str_extract_all()

In addition to str_extract(), "stringr" also provides the function str_extract_all(). As its name indicates, we use str_extract_all() to extract all patterns in a vector string. Taking the same string as in the previous example, we can extract all the hashtag matches like so:

# extract (all) hashtags
str_extract_all(paris_tweets, "#[a-zA-Z]{1,}")
[[1]]
[1] "#Paris"

[[2]]
[1] "#Paris"  "#Amelie"

[[3]]
[1] "#Paris"

[[4]]
character(0)

Compared to str_extract(), the output of str_extract_all() is a list of same length as string. In addition, those elements that don’t match the pattern are indicated with an empty character vector character(0) instead of NA.

13.4 Extract first match group with str_match()

Closely related to str_extract() the package "stringr" offers another extracting function: str_match(). This function not only extracts the matched pattern but it also shows each of the matched groups in a regex character class pattern.

# string vector
strings <- c("12 Jun 2002", " 8 September 2004 ", "22-July-2009 ", 
            "01 01 2001", "date", "02.06.2000", 
            "xxx-yyy-zzzz", "$2,600")

# date pattern (month as text)
dates = "([0-9]{1,2})[- .]([a-zA-Z]+)[- .]([0-9]{4})"

# extract first matched group
str_match(strings, dates)
     [,1]               [,2] [,3]        [,4]  
[1,] "12 Jun 2002"      "12" "Jun"       "2002"
[2,] "8 September 2004" "8"  "September" "2004"
[3,] "22-July-2009"     "22" "July"      "2009"
[4,] NA                 NA   NA          NA    
[5,] NA                 NA   NA          NA    
[6,] NA                 NA   NA          NA    
[7,] NA                 NA   NA          NA    
[8,] NA                 NA   NA          NA    

Note that the output is not a vector but a character matrix. The first column is the complete match, the other columns are each of the captured groups. For those unmatched elements, there is a missing value NA.

13.5 Extract all matched groups with str_match_all()

If what we’re looking for is extracting all patterns in a string vector, instead of using str_extract() we should use str_extract_all():

# tweets about 'Paris'
paris_tweets <- c(
  "#Paris is chock-full of cultural and culinary attractions",
  "Some time in #Paris along Canal St.-Martin famous by #Amelie",
  "While you're in #Paris, stop at cafe: http://goo.gl/yaCbW",
  "Paris, the city of light")

# match (all) hashtags in 'paris_tweets'
str_match_all(paris_tweets, "#[a-zA-Z]{1,}")
[[1]]
     [,1]    
[1,] "#Paris"

[[2]]
     [,1]     
[1,] "#Paris" 
[2,] "#Amelie"

[[3]]
     [,1]    
[1,] "#Paris"

[[4]]
     [,1]

Compared to str_match(), the output of str_match_all() is a list. Note al also that each element of the list is a matrix with as many rows as hashtag matches. In turn, those elements that don’t match the pattern are indicated with an empty character vector character(0) instead of a NA.

13.6 Locate first match with str_locate()

Besides detecting, extracting and matching regex patterns, "stringr" allows us to locate occurrences of patterns. For locating the position of the first occurrence of a pattern in a string vector, we should use str_locate().

# locate position of (first) hashtag
str_locate(paris_tweets, "#[a-zA-Z]{1,}")
     start end
[1,]     1   6
[2,]    14  19
[3,]    17  22
[4,]    NA  NA

The output of str_locate() is a matrix with two columns and as many rows as elements in the (string) vector. The first column of the output is the start position, while the second column is the end position.

In the previous example, the result is a matrix with 4 rows and 2 columns. The first row corresponds to the hashtag of the first tweet. It starts at position 1 and ends at position 6. The second row corresponds to the hashtag of the second tweet; its start position is the 14th character, and its end position is the 19th character. The fourth row corresponds to the fourth tweet. Since there are no hashtags the values in that row are NA’s.

13.7 Locate all matches with str_locate_all()

To locate not just the first but all the occurrence patterns in a string vector, we should use str_locate_all():

# locate (all) hashtags in 'paris_tweets'
str_locate_all(paris_tweets, "#[a-zA-Z]{1,}")
[[1]]
     start end
[1,]     1   6

[[2]]
     start end
[1,]    14  19
[2,]    54  60

[[3]]
     start end
[1,]    17  22

[[4]]
     start end

Compared to str_locate(), the output of str_locate_all() is a list of the same length as the provided string. Each of the list elements is in turn a matrix with two columns. Those elements that don’t match the pattern are indicated with an empty character vector instead of an NA.

Looking at the obtained result from applying str_locate_all() to paris_tweets, you can see that the second element contains the start and end positions for both hashtags #Paris and #Amelie. In turn, the fourth element appears empty since its associated tweet contains no hashtags.

13.8 Replace first match with str_replace()

For replacing the first occurrence of a matched pattern in a string, we can use str_replace(). Its usage has the following form:

str_replace(string, pattern, replacement)

In addition to the main 2 inputs of the rest of functions, str_replace() requires a third argument that indicates the replacement pattern.

Say we have the city names of San Francisco, Barcelona, Naples and Paris in a vector. And let’s suppose that we want to replace the first vowel in each name with a semicolon. Here’s how we can do that:

# city names
cities <- c("San Francisco", "Barcelona", "Naples", "Paris")

# replace first matched vowel
str_replace(cities, "[aeiou]", ";")
[1] "S;n Francisco" "B;rcelona"     "N;ples"        "P;ris"        

Now, suppose that we want to replace the first consonant in each name. We just need to modify the pattern with a negated class:

# replace first matched consonant
str_replace(cities, "[^aeiou]", ";")
[1] ";an Francisco" ";arcelona"     ";aples"        ";aris"        

13.9 Replace all matches with str_replace_all()

For replacing all occurrences of a matched pattern in a string, we can use str_replace_all(). Once again, consider a vector with some city names, and let’s suppose that we want to replace all the vowels in each name:

# city names
cities <- c("San Francisco", "Barcelona", "Naples", "Paris")

# replace all matched vowel
str_replace_all(cities, pattern="[aeiou]", ";")
[1] "S;n Fr;nc;sc;" "B;rc;l;n;"     "N;pl;s"        "P;r;s"        

Alternatively, to replace all consonants with a semicolon in each name, we just need to change the pattern with a negated class:

# replace all matched consonants
str_replace_all(cities, pattern="[^aeiou]", ";")
[1] ";a;;;;a;;i;;o" ";a;;e;o;a"     ";a;;e;"        ";a;i;"        

13.10 String splitting with str_split()

Similar to strsplit(), "stringr" gives us the function str_split() to separate a character vector into a number of pieces. This function has the following usage:

str_split(string, pattern, n = Inf)

The argument n is the maximum number of pieces to return. The default value (n= Inf) implies that all possible split positions are used.

Let’s see the same example of strsplit() in which we wish to split up a sentence into individuals words:

# a sentence
sentence <- c("R is a collaborative project with many contributors")

# split into words
str_split(sentence, " ")
[[1]]
[1] "R"             "is"            "a"             "collaborative"
[5] "project"       "with"          "many"          "contributors" 

Likewise, we can break apart the portions of a telephone number by splitting those sets of digits joined by a dash "-"

# telephone numbers
tels = c("510-548-2238", "707-231-2440", "650-752-1300")

# split each number into its portions
str_split(tels, "-")
[[1]]
[1] "510"  "548"  "2238"

[[2]]
[1] "707"  "231"  "2440"

[[3]]
[1] "650"  "752"  "1300"

The result is a list of character vectors. Each element of the string vector corre- sponds to an element in the resulting list. In turn, each of the list elements will contain the split vectors (i.e. number of pieces) occurring from the matches.

In order to show the use of the argument n, let’s consider a vector with flavors "chocolate", "vanilla", "cinnamon", "mint", and "lemon". Suppose we want to split each flavor name defining as pattern the class of vowels:

# string
flavors <- c("chocolate", "vanilla", "cinnamon", "mint", "lemon")

# split by vowels
str_split(flavors, "[aeiou]")
[[1]]
[1] "ch" "c"  "l"  "t"  ""  

[[2]]
[1] "v"  "n"  "ll" ""  

[[3]]
[1] "c"  "nn" "m"  "n" 

[[4]]
[1] "m"  "nt"

[[5]]
[1] "l" "m" "n"

Now let’s modify the maximum number of pieces to n = 2. This means that str_split() will split each element into a maximum of 2 pieces. Here’s what we obtain:

# split by first vowel
str_split(flavors, "[aeiou]", n=2)
[[1]]
[1] "ch"     "colate"

[[2]]
[1] "v"     "nilla"

[[3]]
[1] "c"      "nnamon"

[[4]]
[1] "m"  "nt"

[[5]]
[1] "l"   "mon"

13.11 String splitting with str_split_fixed()

In addition to str_split(), there is also the str_split_fixed() function that splits up a string into a fixed number of pieces. Its usage has the following form:

str_split_fixed(string, pattern, n)

Note that the argument n does not have a default value. In other words, we need to specify an integer to indicate the number of pieces.

Consider again the same vector of flavors, and the letter "n" as the pattern to match. Let’s see the behavior of str_split_fixed() with n = 2.

# string
flavors <- c("chocolate", "vanilla", "cinnamon", "mint", "lemon")

# split flavors into 2 pieces
str_split_fixed(flavors, "n", 2)
     [,1]        [,2]   
[1,] "chocolate" ""     
[2,] "va"        "illa" 
[3,] "ci"        "namon"
[4,] "mi"        "t"    
[5,] "lemo"      ""     

The output is a character matrix with as many columns as n = 2. Since "chocolate" does not contain any letter "n", its corresponding value in the second column remains empty "". In contrast, the value of the second column associated to "lemon" is also empty. But this is because this flavor is split up into "lemo" and "".

If we change the value n = 3, we will obtain a matrix with three columns:

# split favors into 3 pieces
str_split_fixed(flavors, "n", 3)
     [,1]        [,2]   [,3]  
[1,] "chocolate" ""     ""    
[2,] "va"        "illa" ""    
[3,] "ci"        ""     "amon"
[4,] "mi"        "t"    ""    
[5,] "lemo"      ""     ""