12  Boundaries and Look Arounds

In chapter 9 we introduced a handful of character classes such as \d which matches any digit, or \w which matches any character considered to be part of a word. Among these classes, there are two special metacharacters, \b and \B, known as boundaries, that deserve further discussion.

Likewise, we need to talk about another useful regex concept known as look arounds. The patterns behind this notion will allow us to match tokens around auxiliary tokens are not to be matched.

12.1 Boundaries

Boundaries are metacharacters too and match based on what preceds or follows current position.

Boundary Description Example
\\b Matches a word boundary, i.e., when a side is not [A-z0-9_] \\bHi \\bHi\\b
\\B Matches when not a word boundary, i.e., when a side is [A-z0-9_] \\BHi

To understand word boundaries, let us go over a simple example. Consider the string in vector book shown below

book <- c("This book is an irresistible thesis")
book
[1] "This book is an irresistible thesis"

Suppose we are interested in matching the pattern is.

If you observe the text in book, you’ll notice that is appears in the words This, is, irresistible, and this. Depending on which function you use, you will be able to match just the first occurrence

# matching first occurrence
str_view(book, "is")

or match all occurrences:

# matching all occurrences
str_view_all(book, "is")
Warning: `str_view()` was deprecated in stringr 1.5.0.
ℹ Please use `str_view_all()` instead.

Instead of matching is without any other restriction or condition, we can also think of three additional matching cases that are more concrete:

  1. Extract words that exactly match with is

  2. Extract words that contain is in between characters

  3. Extract words that end with is

Case 1: Words that exactly match is

To match (or extract) words that exactly match the word is, we use the boundary-word pattern \\b. Note the optional use of ignore_case argument to regex()

str_view_all(
  string = book, 
  pattern = regex("\\bis\\b", ignore_case = TRUE))

Case 2: Words containing is in between characters

To match words that contain is in between characters, such as “irresistible”, we use the not a word boundary \\B on both sides of is:

str_view_all(
  string = book, 
  pattern = regex("\\Bis\\B", ignore_case = TRUE))

To extract the entire word “irresistible”, we must include [A-z]* on either side of \\Bis\\B, otherwise the pattern \\Bis\\B will only match is

str_extract(
  string = book, 
  pattern = regex("[A-z]*\\Bis\\B[A-z]*", ignore_case = TRUE))
[1] "irresistible"

Notice that we use [A-z]* instead of [A-z]+ to specifically showcase that no other is got matched as * denotes 0 or any.

You may ask “What if we don’t surround \\Bis\\B with [A-z]*?” Here is what happens:

str_view_all(
  string = book, 
  pattern = regex("[A-z]*\\Bis", ignore_case = TRUE))

The pattern [A-z]*\\Bis matches “This”, “irresis”, and “thesis”, which are extracted as follows:

str_extract_all(
  string = book, 
  pattern = regex("[A-z]*\\Bis", ignore_case = TRUE))
[[1]]
[1] "This"    "irresis" "thesis" 

Case 3: Match (or extract) words that end with is

If you are interested in extracting words that end with is, then you need to use a pattern with the word boundary is\\b. Now, you also need to determine if is should be part of a word with one or more preceding word-characters, of if is should not be preceded by any word-characters.

# end with "is", with preceding word-characters 
str_view_all(
  string = book, 
  pattern = regex("[A-z]+is\\b", ignore_case = TRUE))
# end with "is", without preceding word-characters 
str_view_all(
  string = book, 
  pattern = regex("\\bis\\b", ignore_case = TRUE))

12.2 Look Arounds

As the name suggests, this type of pattern allows us to look around the string in order to match the desired pattern. To be more precise, look arounds indicate positions just like anchors, $ and ^.

There are four types of look arounds, listed in the following table.

Look Around Notation Description
Positive Look Ahead A(?=pattern) Check if pattern follows A
Negative Look Ahead A(?!pattern) Check if pattern does not follow A
Positive Look Behind (?<=pattern)A Check if pattern precedes A
Negative Look Behind (?<!pattern)A Check if pattern does not preced A

In this table, A refers to a character set or group that we are trying to match.

12.2.1 Look Aheads

Let us look at some examples for Look Aheads. To do this, consider the vector heights shown below

heights <- c("40cm", "23", "60cm", "57", "133cm")
heights
[1] "40cm"  "23"    "60cm"  "57"    "133cm"

Suppose we want to extract the heights without the unit of measurement, that is, we want to obtain the numeric values but not the letters cm. We can do this by specifying a pattern to match one or more digits [0-9]+

str_extract(heights, "[0-9]+")
[1] "40"  "23"  "60"  "57"  "133"

Let’s change the format of heights by collapsing all of its elements into a single string, with height values separated by commas:

heights <- paste(heights, collapse = ", ")
heights
[1] "40cm, 23, 60cm, 57, 133cm"

With this modified heights, if we use the previous command, we only extract the first occurrence:

str_extract(heights, "[0-9]+")
[1] "40"

In order to extract all occurrences, we must use str_extract_all()

str_extract_all(heights, "[0-9]+")
[[1]]
[1] "40"  "23"  "60"  "57"  "133"

Let’s change our input to contain heights that are in inches

heights <- "40cm, 23in, 60cm, 57, 133cm, 15in, 99"
heights
[1] "40cm, 23in, 60cm, 57, 133cm, 15in, 99"

and reapply the previous command

str_extract_all(heights, "[0-9]+")
[[1]]
[1] "40"  "23"  "60"  "57"  "133" "15"  "99" 

In case we want to retrieve only those heights with unit of measurement cm, we could use a positive look ahead. In the syntax A(?=pattern), the pattern we look for would be cm. Any number that is followed by cm will be extracted hence A is [0-9][0-9]+.

Don’t forget that (?=cm) does not extract cm, it is used to assert position only!

heights <- "40cm, 23in, 60cm, 57, 133cm, 15in, 99"

str_extract_all(heights, "[0-9][0-9]+(?=cm)")
[[1]]
[1] "40"  "60"  "133"

In case we want to extract all heights that don’t have cm as the unit of measurement, we could use negative look ahead. In the syntax A(?!pattern), the pattern here is cm and A should be character class [0-9] with quantifier + (one or many).

heights <- "40cm, 23in, 60cm, 57, 133cm, 15in, 99"

str_extract_all(heights, "[0-9][0-9]+(?!cm)")
[[1]]
[1] "23" "57" "13" "15" "99"

Similarly, using negative look ahead to extract all heights that don’t have cm as the unit of measurement should work too, but in the code snippet below, we get an incorrect output. Can you guess why?

heights <- "40cm, 23in, 60cm, 57, 133cm, 15in, 99"

str_extract_all(heights, regex("[0-9]+(?!cm)"))
[[1]]
[1] "4"  "23" "6"  "57" "13" "15" "99"

This is incorrect since we extract the 4 from 40cm, 6 from 60cm, and 13 from 133cm additional to our actual answer.

We could overcome this by specifically mentioning that the number (height) cannot be followed by:

  • the pattern cm

  • some number (i.e, from 40cm, we should not extract 4)

To do this, we could use alternation, denoted as pipe |, which is similar to OR.

The pattern in A(?!pattern) is now cm or [0-9]+, which we can represent as (cm|[0-9]+). Note that our pattern in enclosed within paranthesis.

heights <- "40cm, 23in, 60cm, 57, 133cm, 15in, 99"

str_extract_all(heights, regex("[0-9]+(?!(cm|[0-9]+))"))
[[1]]
[1] "23" "57" "15" "99"

The illustration above is typical of regular expressions. As the test cases become complex, you will have to tweek the expression to include all the corner cases.

12.2.2 Look Behinds

As the name suggests, this type of metacharacters allows us to look behind the current position for presence or absence of a pattern. This works the same way as look ahead, except that we look for the preceding characters.

Let us look at some examples.

Consider a vector courses that has the name and year of some statistics courses over different semesters.

courses <- c(
  "Stat_133 2020", 
  "Stat_154 2020", 
  "Stat_133 2019", 
  "Stat_151 2018", 
  "Stat_151 2020", 
  "Stat_154 2018")

courses
[1] "Stat_133 2020" "Stat_154 2020" "Stat_133 2019" "Stat_151 2018"
[5] "Stat_151 2020" "Stat_154 2018"

We will use a positive look-behind to extract the years associated to Stat_133. In the syntax (?<=pattern)P, pattern is (Stat_133 ) and P is the semester we want to extract, i.e., [0-9]{4}.

courses <- c(
  "Stat_133 2020", 
  "Stat_154 2020", 
  "Stat_133 2019", 
  "Stat_151 2018", 
  "Stat_151 2020", 
  "Stat_154 2018")

str_extract(courses, regex("(?<=(Stat_133 ))[0-9]{4}"))
[1] "2020" NA     "2019" NA     NA     NA    

Similarly, we can use a negative look-behind to extract semesters of courses that are not Stat_133. In the syntax (?<!pattern)P, P is [0-9]+ and the pattern is (Stat_133 ).

courses <- c(
  "Stat_133 2020", 
  "Stat_154 2020", 
  "Stat_133 2019", 
  "Stat_151 2018", 
  "Stat_151 2020", 
  "Stat_154 2018")

str_extract(courses, "(?<!(Stat_133 ))[0-9]{4}")
[1] NA     "2020" NA     "2018" "2020" "2018"

12.3 Logical Operators in Regex

We don’t have earmarked Logical Operators in Regex, however, a few syntaxes could replicate these.

We saw in one of the examples for look aheads that logical OR is expressed using the pipe symbol | in regex. This is also known as ’Alternation` operation.

There is no AND in regex, but it can be synthesized using look arounds. For NOT operation, the ^ symbol works in character classes but cannot be used for groups as explained later.

Operation Syntax Example
OR Pipe symbol | pattern1|pattern2
NOT Cap symbol ^ [^aeiou]
AND Synthetic AND (?=P1)(?=P2)

12.3.1 Logical OR

Consider the vector people from a previous example.

people <- c(
  "rori", 
  "emilia", 
  "matteo", 
  "mehmet", 
  "filipe", 
  "ana", 
  "victoria")

people
[1] "rori"     "emilia"   "matteo"   "mehmet"   "filipe"   "ana"      "victoria"

If we want to display names that are of length 3 OR of length 4, we use the pipe symbol.

str_extract(people, regex("^([A-z]{3}|[A-z]{4})$"))
[1] "rori" NA     NA     NA     NA     "ana"  NA    

Note that for the above example, we could use quantifier {3,4} and still obtain the same result.

str_extract(people, regex("^([A-z]{3,4})$"))
[1] "rori" NA     NA     NA     NA     "ana"  NA    

12.3.2 Logical NOT

Using the NOT ^, we could extract names that don’t contain e or u.

str_extract(people, regex("^[^eu]+$"))
[1] "rori"     NA         NA         NA         NA         "ana"      "victoria"

Note that ^ operation cannot be used for groups. This is because it is unclear in such cases if we are using ^ as a NOT operator or an anchor. In such cases we could use a negative look around. Let’s say, we have a group (ia) and we want to extract names without the group (ia).

str_extract(people, regex("^(?!.*ia).*$"))
[1] "rori"   NA       "matteo" "mehmet" "filipe" "ana"    NA      

12.3.3 Logical AND

Consider a case where we want to extract names with length greater than 4 and containing letter o. We have two conditions and we need to AND them.

  • Condition 1: Length > 4, the pattern is (?=.{5,})

  • Condition 2: Contains o, the pattern is (?=.*o)

The two conditions can be used together.

people <- c(
  "rori", 
  "emilia", 
  "matteo", 
  "mehmet", 
  "filipe", 
  "ana", 
  "victoria")

str_extract(people, regex("(?=.{5,})(?=.*o).*"))
[1] NA         NA         "matteo"   NA         NA         NA         "victoria"