<- c("This book is an irresistible thesis")
book book
[1] "This book is an irresistible thesis"
In chapter 9 we introduced a handful of character classes such as \d
which matches any digit, or \w
which matches any character considered to be part of a word. Among these classes, there are two special metacharacters, \b
and \B
, known as boundaries, that deserve further discussion.
Likewise, we need to talk about another useful regex concept known as look arounds. The patterns behind this notion will allow us to match tokens around auxiliary tokens are not to be matched.
Boundaries are metacharacters too and match based on what preceds or follows current position.
Boundary | Description | Example |
---|---|---|
\\b |
Matches a word boundary, i.e., when a side is not [A-z0-9_] |
\\bHi \\bHi\\b |
\\B |
Matches when not a word boundary, i.e., when a side is [A-z0-9_] |
\\BHi |
To understand word boundaries, let us go over a simple example. Consider the string in vector book
shown below
<- c("This book is an irresistible thesis")
book book
[1] "This book is an irresistible thesis"
Suppose we are interested in matching the pattern is
.
If you observe the text in book
, you’ll notice that is
appears in the words This
, is
, irresistible
, and this
. Depending on which function you use, you will be able to match just the first occurrence
# matching first occurrence
str_view(book, "is")
or match all occurrences:
# matching all occurrences
str_view_all(book, "is")
Warning: `str_view()` was deprecated in stringr 1.5.0.
ℹ Please use `str_view_all()` instead.
Instead of matching is
without any other restriction or condition, we can also think of three additional matching cases that are more concrete:
Extract words that exactly match with is
Extract words that contain is
in between characters
Extract words that end with is
is
To match (or extract) words that exactly match the word is
, we use the boundary-word pattern \\b
. Note the optional use of ignore_case
argument to regex()
str_view_all(
string = book,
pattern = regex("\\bis\\b", ignore_case = TRUE))
is
in between charactersTo match words that contain is
in between characters, such as “irresistible”, we use the not a word boundary \\B
on both sides of is
:
str_view_all(
string = book,
pattern = regex("\\Bis\\B", ignore_case = TRUE))
To extract the entire word “irresistible”, we must include [A-z]*
on either side of \\Bis\\B
, otherwise the pattern \\Bis\\B
will only match is
str_extract(
string = book,
pattern = regex("[A-z]*\\Bis\\B[A-z]*", ignore_case = TRUE))
[1] "irresistible"
Notice that we use [A-z]*
instead of [A-z]+
to specifically showcase that no other is
got matched as *
denotes 0 or any.
You may ask “What if we don’t surround \\Bis\\B
with [A-z]*
?” Here is what happens:
str_view_all(
string = book,
pattern = regex("[A-z]*\\Bis", ignore_case = TRUE))
The pattern [A-z]*\\Bis
matches “This”, “irresis”, and “thesis”, which are extracted as follows:
str_extract_all(
string = book,
pattern = regex("[A-z]*\\Bis", ignore_case = TRUE))
[[1]]
[1] "This" "irresis" "thesis"
is
If you are interested in extracting words that end with is
, then you need to use a pattern with the word boundary is\\b
. Now, you also need to determine if is
should be part of a word with one or more preceding word-characters, of if is
should not be preceded by any word-characters.
# end with "is", with preceding word-characters
str_view_all(
string = book,
pattern = regex("[A-z]+is\\b", ignore_case = TRUE))
# end with "is", without preceding word-characters
str_view_all(
string = book,
pattern = regex("\\bis\\b", ignore_case = TRUE))
As the name suggests, this type of pattern allows us to look around the string in order to match the desired pattern. To be more precise, look arounds indicate positions just like anchors, $
and ^
.
There are four types of look arounds, listed in the following table.
Look Around | Notation | Description |
---|---|---|
Positive Look Ahead | A(?=pattern) |
Check if pattern follows A |
Negative Look Ahead | A(?!pattern) |
Check if pattern does not follow A |
Positive Look Behind | (?<=pattern)A |
Check if pattern precedes A |
Negative Look Behind | (?<!pattern)A |
Check if pattern does not preced A |
In this table, A
refers to a character set or group that we are trying to match.
Let us look at some examples for Look Aheads. To do this, consider the vector heights
shown below
<- c("40cm", "23", "60cm", "57", "133cm")
heights heights
[1] "40cm" "23" "60cm" "57" "133cm"
Suppose we want to extract the heights without the unit of measurement, that is, we want to obtain the numeric values but not the letters cm
. We can do this by specifying a pattern to match one or more digits [0-9]+
str_extract(heights, "[0-9]+")
[1] "40" "23" "60" "57" "133"
Let’s change the format of heights
by collapsing all of its elements into a single string, with height values separated by commas:
<- paste(heights, collapse = ", ")
heights heights
[1] "40cm, 23, 60cm, 57, 133cm"
With this modified heights
, if we use the previous command, we only extract the first occurrence:
str_extract(heights, "[0-9]+")
[1] "40"
In order to extract all occurrences, we must use str_extract_all()
str_extract_all(heights, "[0-9]+")
[[1]]
[1] "40" "23" "60" "57" "133"
Let’s change our input to contain heights that are in inches
<- "40cm, 23in, 60cm, 57, 133cm, 15in, 99"
heights heights
[1] "40cm, 23in, 60cm, 57, 133cm, 15in, 99"
and reapply the previous command
str_extract_all(heights, "[0-9]+")
[[1]]
[1] "40" "23" "60" "57" "133" "15" "99"
In case we want to retrieve only those heights with unit of measurement cm
, we could use a positive look ahead. In the syntax A(?=pattern)
, the pattern we look for would be cm
. Any number that is followed by cm
will be extracted hence A
is [0-9][0-9]+
.
Don’t forget that (?=cm)
does not extract cm
, it is used to assert position only!
<- "40cm, 23in, 60cm, 57, 133cm, 15in, 99"
heights
str_extract_all(heights, "[0-9][0-9]+(?=cm)")
[[1]]
[1] "40" "60" "133"
In case we want to extract all heights that don’t have cm
as the unit of measurement, we could use negative look ahead. In the syntax A(?!pattern)
, the pattern here is cm
and A
should be character class [0-9]
with quantifier +
(one or many).
<- "40cm, 23in, 60cm, 57, 133cm, 15in, 99"
heights
str_extract_all(heights, "[0-9][0-9]+(?!cm)")
[[1]]
[1] "23" "57" "13" "15" "99"
Similarly, using negative look ahead to extract all heights that don’t have cm
as the unit of measurement should work too, but in the code snippet below, we get an incorrect output. Can you guess why?
<- "40cm, 23in, 60cm, 57, 133cm, 15in, 99"
heights
str_extract_all(heights, regex("[0-9]+(?!cm)"))
[[1]]
[1] "4" "23" "6" "57" "13" "15" "99"
This is incorrect since we extract the 4 from 40cm, 6 from 60cm, and 13 from 133cm additional to our actual answer.
We could overcome this by specifically mentioning that the number (height) cannot be followed by:
the pattern cm
some number (i.e, from 40cm, we should not extract 4)
To do this, we could use alternation
, denoted as pipe |
, which is similar to OR
.
The pattern in A(?!pattern)
is now cm
or [0-9]+
, which we can represent as (cm|[0-9]+)
. Note that our pattern in enclosed within paranthesis.
<- "40cm, 23in, 60cm, 57, 133cm, 15in, 99"
heights
str_extract_all(heights, regex("[0-9]+(?!(cm|[0-9]+))"))
[[1]]
[1] "23" "57" "15" "99"
The illustration above is typical of regular expressions. As the test cases become complex, you will have to tweek the expression to include all the corner cases.
As the name suggests, this type of metacharacters allows us to look behind the current position for presence or absence of a pattern. This works the same way as look ahead, except that we look for the preceding characters.
Let us look at some examples.
Consider a vector courses
that has the name and year of some statistics courses over different semesters.
<- c(
courses "Stat_133 2020",
"Stat_154 2020",
"Stat_133 2019",
"Stat_151 2018",
"Stat_151 2020",
"Stat_154 2018")
courses
[1] "Stat_133 2020" "Stat_154 2020" "Stat_133 2019" "Stat_151 2018"
[5] "Stat_151 2020" "Stat_154 2018"
We will use a positive look-behind to extract the years associated to Stat_133
. In the syntax (?<=pattern)P
, pattern is (Stat_133 )
and P
is the semester we want to extract, i.e., [0-9]{4}
.
<- c(
courses "Stat_133 2020",
"Stat_154 2020",
"Stat_133 2019",
"Stat_151 2018",
"Stat_151 2020",
"Stat_154 2018")
str_extract(courses, regex("(?<=(Stat_133 ))[0-9]{4}"))
[1] "2020" NA "2019" NA NA NA
Similarly, we can use a negative look-behind to extract semesters of courses that are not Stat_133
. In the syntax (?<!pattern)P
, P
is [0-9]+
and the pattern is (Stat_133 )
.
<- c(
courses "Stat_133 2020",
"Stat_154 2020",
"Stat_133 2019",
"Stat_151 2018",
"Stat_151 2020",
"Stat_154 2018")
str_extract(courses, "(?<!(Stat_133 ))[0-9]{4}")
[1] NA "2020" NA "2018" "2020" "2018"
We don’t have earmarked Logical Operators in Regex, however, a few syntaxes could replicate these.
We saw in one of the examples for look aheads that logical OR is expressed using the pipe symbol | in regex. This is also known as ’Alternation` operation.
There is no AND
in regex, but it can be synthesized using look arounds. For NOT
operation, the ^
symbol works in character classes but cannot be used for groups as explained later.
Operation | Syntax | Example |
---|---|---|
OR | Pipe symbol | |
pattern1|pattern2 |
NOT | Cap symbol ^ |
[^aeiou] |
AND | Synthetic AND |
(?=P1)(?=P2) |
Consider the vector people
from a previous example.
<- c(
people "rori",
"emilia",
"matteo",
"mehmet",
"filipe",
"ana",
"victoria")
people
[1] "rori" "emilia" "matteo" "mehmet" "filipe" "ana" "victoria"
If we want to display names that are of length 3 OR of length 4, we use the pipe symbol.
str_extract(people, regex("^([A-z]{3}|[A-z]{4})$"))
[1] "rori" NA NA NA NA "ana" NA
Note that for the above example, we could use quantifier {3,4}
and still obtain the same result.
str_extract(people, regex("^([A-z]{3,4})$"))
[1] "rori" NA NA NA NA "ana" NA
Using the NOT ^
, we could extract names that don’t contain e
or u
.
str_extract(people, regex("^[^eu]+$"))
[1] "rori" NA NA NA NA "ana" "victoria"
Note that ^
operation cannot be used for groups. This is because it is unclear in such cases if we are using ^
as a NOT operator or an anchor. In such cases we could use a negative look around. Let’s say, we have a group (ia)
and we want to extract names without the group (ia)
.
str_extract(people, regex("^(?!.*ia).*$"))
[1] "rori" NA "matteo" "mehmet" "filipe" "ana" NA
Consider a case where we want to extract names with length greater than 4 and containing letter o
. We have two conditions and we need to AND
them.
Condition 1: Length > 4, the pattern is (?=.{5,})
Condition 2: Contains o
, the pattern is (?=.*o)
The two conditions can be used together.
<- c(
people "rori",
"emilia",
"matteo",
"mehmet",
"filipe",
"ana",
"victoria")
str_extract(people, regex("(?=.{5,})(?=.*o).*"))
[1] NA NA "matteo" NA NA NA "victoria"