13 Anchors

by Chitra Venkatesh

Anchors are metacharacters that help us assert the position, say, the beginning or end of the string.

Anchor	Description	Example
`^`	Matches a line starting with the substring.	`^New`
`\\$`	Matches a line ending with the substring.	`y$`
`^ \\$`	Matches a line that starts and ends with substring, i.e., exact match.	`^Hi There$`
`\\A`	Matches input starting with the substring.	`\\AHello`
`\\Z` and `\\z`	Matches input ending with the substring. `\\Z` also matches if there is a newline after the substring.	`End\\Z`

As an example, we will consider a simple character vector university containing names of some universities.

university <- c(
  "University of Southern California", 
  "California State University", 
  "Stanford University", 
  "University of California, Berkeley")

university
#> [1] "University of Southern California"  "California State University"       
#> [3] "Stanford University"                "University of California, Berkeley"

Let’s try to detect university names that begin with University. To do this, we use str_detect() from the "stringr" package. Likewise, we use regex() inside str_detect() to let R know that we are going to pass a regex pattern. This is also the default setting so you may choose to skip it.

What about the regex pattern? Because we are interested in matching those strings that begin with the word University, we are going to use the ^ metacharacter as the starting anchor, followed by the string we want to match: ^University. Here’s how:

university <- c(
  "University of Southern California", 
  "California State University", 
  "Stanford University", 
  "University of California, Berkeley")

str_detect(university, regex("^University"))
#> [1]  TRUE FALSE FALSE  TRUE

As you can tell from the following illustration, only the first and fourth elements in university are being matched.

Similarly we could detect university names that end with University. To do this, we have to use the metacharacter $ to indicate the ending anchor, forming the pattern: University$.

str_detect(university, regex("University$"))
#> [1] FALSE  TRUE  TRUE FALSE

In previous example, we detected universities with names that end with University. Let’s try to extract them now. To make things more interesting, we have modified the content of vector university, now consisting of multiple lines (notice the newline characters \n).

university <- c(
  "University of Southern California
  \nCalifornia State University
  \nStanford University
  \nUniversity of California, Berkeley\n")

cat(university)
#> University of Southern California
#>   
#> California State University
#>   
#> Stanford University
#>   
#> University of California, Berkeley

We use str_extract_all() instead of str_extract() to extract all occurances of the pattern. Also note the multiline argument is used in regex() to tell R to expect input consisting of multiple lines.

university <- c(
  "University of Southern California
  \nCalifornia State University
  \nStanford University
  \nUniversity of California, Berkeley\n")

str_extract_all(university, regex("[A-z ]*University$", multiline = TRUE))
#> [[1]]
#> [1] "California State University" "Stanford University"

Lastly, lets try to extract the last word of our input from previous example.

Using str_extract() or str_extract_all() does not matter anymore. While we get a single output for both, the former returns a list and the latter returns a list of lists.

str_extract(university, regex("[A-z]+\\Z", multiline = TRUE))
#> [1] "Berkeley"

Notice that \\Z works even in presence of a terminating newline \n. However, when we use \\z, this won’t work until we remove the terminating \n.

str_extract(university, regex("[A-z ]+\\z", multiline = TRUE))
#> [1] NA

With the newline terminator removed from the input, \\z works just as well.

university = c("University of Southern California
               \nCalifornia State University
               \nStanford University
               \nUniversity of California, Berkeley")

str_extract(university, regex("[A-z ]+\\Z", multiline = TRUE))
#> [1] " Berkeley"

str_extract(university, regex("[A-z ]+\\z", multiline = TRUE))
#> [1] " Berkeley"