10  Anchors

In this chapter, we discuss the regex topics known as anchors and quantifiers.

# packages used in this chapter
library(stringr)

10.1 Anchors

Anchors are metacharacters that help us assert the position, say, the beginning or end of the string.

Anchor Description Example
^ Matches a line starting with the substring. ^New
\\$ Matches a line ending with the substring. y$
^ \\$ Matches a line that starts and ends with substring, i.e., exact match. ^Hi There$
\\A Matches input starting with the substring. \\AHello
\\Z and \\z Matches input ending with the substring. \\Z also matches if there is a newline after the substring. End\\Z

As an example, we will consider a simple character vector universities containing names of some universities.

universities <- c(
  "University of California, Berkeley",
  "University of California, San Francisco",
  "San Francisco State University", 
  "California State University")

universities
[1] "University of California, Berkeley"     
[2] "University of California, San Francisco"
[3] "San Francisco State University"         
[4] "California State University"            

10.1.1 Start of String

Let’s try to detect university names that begin with University. To do this, we use str_detect() from the "stringr" package. Obviously, we need to provide a useful regex pattern that looks for the word University at the beginning of the text. How do we do this? With the caret ^ metacharacter to match for a starting anchor, followed by the string we want to match: "^University"

str_detect(universities, "^University")
[1]  TRUE  TRUE FALSE FALSE

As you can tell, only the first and second elements in universities are being matched.

str_extract(universities, "^University")
[1] "University" "University" NA           NA          
universities[str_detect(universities, "^University")]
[1] "University of California, Berkeley"     
[2] "University of California, San Francisco"

10.1.2 End of String

Now let’s try to detect university names that end with the word University. To do this, we have to use the metacharacter $ to indicate the ending anchor, forming the pattern: "University\$".

str_detect(universities, regex("University$"))
[1] FALSE FALSE  TRUE  TRUE

Compared to the previous exmaple, now only the third and fourth elements in universities are being matched.

str_extract(universities, "University$")
[1] NA           NA           "University" "University"
universities[str_detect(universities, "University$")]
[1] "San Francisco State University" "California State University"   

To make things more interesting, we have modified the content of vector universities, now consisting of multiple lines (notice the newline characters \).

universities <- c(
  "University of California, Berkeley
  \nUniversity of California, San Francisco
  \nSan Francisco State University
  \nCalifornia State University\n")

cat(universities)
University of California, Berkeley
  
University of California, San Francisco
  
San Francisco State University
  
California State University

Say we are interested in extracting university names that end with the word University. We can try using str_extract() with the pattern University\$

str_detect(universities, "University$")
[1] TRUE
universities[str_detect(universities, "University$")]
[1] "University of California, Berkeley\n  \nUniversity of California, San Francisco\n  \nSan Francisco State University\n  \nCalifornia State University\n"

Something was match, but what exactly? str_view() can give us the answer to this question:

str_view(universities, "University$")

Notice that the matched effectively occurred at the very end of the string in universitites. But what if what we are really interested in is in matching the names San Francisco State University as well as California State University?

We shall use str_extract_all() instead of str_extract() to extract all occurances of the pattern. In addition, the pattern should be specified inside the regex() function, using its multiline argument to tell R to expect input consisting of multiple lines. Here’s how:

str_extract_all(universities, 
                regex("[A-z ]*University$", multiline = TRUE))
[[1]]
[1] "San Francisco State University" "California State University"   

Lastly, let’s try to extract the last word of our input from previous example.

Using str_extract() or str_extract_all() does not matter anymore. While we get a single output for both, the former returns a list and the latter returns a list of lists.

str_extract(universities, regex("[A-z]+\\Z", multiline = TRUE))
[1] "University"

Notice that \\Z works even in presence of a terminating newline \n. However, when we use \\z, this won’t work until we remove the terminating \n.

str_extract(universities, regex("[A-z ]+\\z", multiline = TRUE))
[1] NA

With the newline terminator removed from the input, \\z works just as well.

universities <- c(
  "University of Southern California
  \nCalifornia State University
  \nStanford University
  \nUniversity of California, Berkeley")

str_extract(universities, regex("[A-z ]+\\Z", multiline = TRUE))
[1] " Berkeley"
str_extract(universities, regex("[A-z ]+\\z", multiline = TRUE))
[1] " Berkeley"