39 Regular Expressions
In chapter Strings you learned some basic and intermediate functions for handling and working with text in R. These are very useful functions and they allow you to do many interesting things. However, if you truly want to unleash the power of strings manipulation, you need to take things to the next level and learn about regular expressions.
You will need the package "stringr"
Most of the material in this chapter is borrowed from Gaston Sanchez’s book Handling Strings in R (with permission from the author).
39.1 What are Regular Expressions?
The name “Regular Expression” does not say much. However, regular expressions are all about text. Think about how much text is all around you in our modern digital world: email, text messages, news articles, blogs, computer code, contacts in your address book—all these things are text. Regular expressions are a tool that allows us to work with these text by describing text patterns.
A regular expression is a special text string for describing a certain amount of text. This “certain amount of text” receives the formal name of pattern. In other words, a regular expression is a set of symbols that describes a text pattern. More formally we say that a regular expression is a pattern that describes a set of strings. In addition to this first meaning, the term regular expression can also be used in a slightly different but related way: as the formal language of these symbols that needs to be interpreted by a regular expression processor. Because the term “regular expression” is rather long, most people use the word regex as a shortcut term. And you will even find the plural regexes.
It is also worth noting what regular expressions are not. They’re not a programming language. They may look like some sort of programming language because they are a formal language with a defined set of rules that gets a computer to do what we want it to do. However, there are no variables in regex and you can’t do computations like adding 1 + 1.
39.1.1 A word of caution about regex
If you have never used regular expressions before, their syntax may seem a bit scary and cryptic. You will see strings formed by a bunch of letters, digits, and other punctuation symbols combined in seemingly nonsensical ways. As with any other topic that has to do with programming and data analysis, learning the principles of regex and becoming fluent in defining regex patterns takes time and requires a lot of practice. The more you use them, the better you will become at defining more complex patterns and getting the most out of them.
Regular Expressions is a wide topic and there are books entirely dedicated to this subject. The material offered in this chapter is not extensive and there are many subtopics that we don’t cover here. Despite the initial barriers that you may encounter when entering the regex world, the pain and frustration of learning this tool will payoff in your data science career.
39.1.2 Regular Expressions in R
Tools for working with regular expressions can be found in virtually all scripting languages (e.g. Perl, Python, Java, Ruby, etc). R has some functions for working with regular expressions but it does not provide the wide range of capabilities that other scripting languages do. Nevertheless, they can take you very far with some workarounds (and a bit of patience).
One of the best tools you must have in your toolkit is the R package "stringr"
(by Hadley Wickham). It provides functions that have similar behavior to
those of the base distribution in R. But it also provides many more facilities
for working with regular expressions.
39.2 Regex Basics
The main purpose of working with regular expressions is to describe patterns that are used to match against text strings. Simply put, working with regular expressions is nothing more than pattern matching. The result of a match is either successful or not.
The simplest version of pattern matching is to search for one occurrence (or
all occurrences) of some specific characters in a string. For example, we might
want to search for the word "programming"
in a large text document, or we
might want to search for all occurrences of the string "apply"
in a series
of files containing R scripts.
Typically, regular expression patterns consist of a combination of alphanumeric characters as well as special characters. A regex pattern can be as simple as a single character, or it can be formed by several characters with a more complex structure. In all cases we construct regular expressions much in the same form in which we construct arithmetic expressions, by using various operators to combine smaller expressions.
39.3 Literal Characters
The simplest match of all is a literal character.
A literal character match is one in which a given character such as the letter
"R"
matches the letter R. This type of match is the most basic
type of regular expression operation: just matching plain text.
The following examples are extremely basic but they will help you get a good understanding of regex.
Consider the following text stored in a character vector this_book
:
The first regular expression we are going to work with is "book"
.
This pattern is formed by a letter b, followed by a letter o, followed by
another letter o, followed by a letter k. As you may guess, this pattern
matches the word book in the character vector this_book
.
To have a visual representation of the actual pattern that is matched, you
should use the function str_view()
from the package "stringr"
(you may need to upgrade to a recent version of RStudio):
As you can tell, the pattern "book"
doesn’t match the entire content in
the vector this_book
; it just matches those four letters.
It may seem really simple but there are a couple of details to be highlighted.
The first is that regex searches are case sensitive by default. This means
that the pattern "Book"
would not match book in this_book
.
39.4 Metacharacters
The next topic that you should learn about regular expressions has to do with metacharacters. As you just learned, the most basic type of regular expressions are the literal characters which are characters that match themselves. However, not all characters match themselves. Any character that is not a literal character is a metacharacter.
Metacharacter are characters that have a special meaning and they allow you to transform literal characters in very powerful ways.
Below is the list of metacharacters in Extended Regular Expressions (EREs):
. \ | ( ) [ ] { } $ - ^ * + ?
- the dot
.
- the backslash
\
- the bar
|
- left or opening parenthesis
(
- right or closing parenthesis
)
- left or opening bracket
[
- right or closing bracket
]
- left or opening brace
{
- right or closing brace
}
- the dollar sign
$
- the dash, hyphen or minus sign
-
- the caret or hat
^
- the star or asterisk
*
- the plus sign
+
- the question mark
?
Simply put, everything else that you need to know about regular expressions besides literal characters is how these metacharacters work. The good news is that there are only a few metacharacters to learn. The bad news is that some metacharacters can have more than one meaning. And learning those meanings definitely takes time and requires hours of practice. The meaning of the metacharacters greatly depend on the context in which you use them, how you use them, and where you use them. If it wasn’t enough complication, it is also the metacharacters that have variation between the different regex engines.
39.4.1 The Wild Metacharacter
The first metacharacter you should learn about is the dot or period "."
,
better known as the wild metacharacter. This metacharacter is used to
match ANY character except for a new line.
For example, consider the pattern "p.n"
, that is, p wildcard n. This
pattern will match pan, pen, and pin, but it will not match prun
or plan. The dot only matches one single character.
Let’s see another example using the vector c("not", "note", "knot", "nut")
and the pattern "n.t"
the pattern "n.t"
matches not in the first three elements, and nut
in the last element.
If you specify a pattern "no."
, then just the first three elements
in not
will be matched.
And if you define a pattern "kn."
, then only the third element is matched.
The wild metacharacter is probably the most used metacharacter, and it is
also the most abused one, being the source of many mistakes. Here is a basic
example with the regular expression formed by "5.00"
. If you think that this
pattern will match five with two decimal places after it, you will be
surprised to find out that it not only matches 5.00 but also 5100 and 5-00.
Why? Because "."
is the metacharacter that matches absolutely anything.
You will learn how to fix this mistake in the next section, but it illustrates
an important fact about regular expressions: the challenge consists of matching
what you want, but also in matching only what you want. You don’t want to
specify a pattern that is overly permissive. You want to find the thing you’re
looking for, but only that thing.
39.4.2 Escaping metacharacters
What if you just want to match the character dot? For example, say you have the following vector:
If you try the pattern "5.00"
, it will match all of the elements in fives
.
To actually match the dot character, what you need to do is escape the
metacharacter. In most languages, the way to escape a metacharacter is by
adding a backslash character in front of the metacharacter: "\."
.
When you use a backslash in front of a metacharacter you are “escaping” the
character, this means that the character no longer has a special meaning,
and it will match itself.
However, R is a bit different. Instead of using a backslash you have to use
two backslashes: "5\\.00"
. This is because the backslash "\"
, which is
another metacharacter, has a special meaning in R. Therefore, to match just
the element 5.00 in fives
in R, you do it like so:
39.5 Character Sets
The opening and closing brackets [ ]
are metacharacters that let you define a
character set. These square brackets indicate a character set which will
match any one of the various characters that are inside the set.
Keep in mind that a character set will match only one character. The order of
the characters inside the set does not matter; what matter is just the presence
of the characters inside the brackets. So for example if you have a set defined
by "[AEIOU]"
, that will match any one upper case vowel.
Consider the following pattern that includes a character set: "p[aeiou]n"
,
and a vector with the words pan, pen, and pin:
The set "p[aeiou]n"
matches all elements in pns
. Now let’s use the same
set with another vector pnx
:
As you can tell, this time only the first three elements in pnx
are matched.
Notice also that paun is not matched. This is because the character set
matches only one character, either a or u but not au.
If you are interested in matching all capital letters in English, you can define a set formed as:
[ABCDEFGHIJKLMNOPQRSTUVWXYZ]
Likewise, you can define a set with only lower case letters in English:
[abcdefghijklmnopqrstuvwxyz]
If you are interested in matching any digit, you can also specify a character set like this:
[0123456789]
39.6 Character ranges
The previous examples that show character sets containing all the capital
letters or all lower case letters are very convenient but require a lot of
typing. Character ranges are going to help you solve that problem, by
giving you a convenient shortcut based on the dash metacharacter "-"
to
indicate a range of characters. A character range consists of a character set
with two characters separated by a dash or minus "-"
sign.
Let’s see how you can reexpress the examples in the previous section as character ranges. The set of all digits can be expressed as a character range using the following pattern:
[0-9]
Likewise, the set of all lower case letters abcd…xyz is compactly represented with the character range:
[a-z]
And the character set of all upper case letters ABD…XYZ is formed by
[A-Z]
Note that the dash is only a metacharacter when it is inside a character set; outside the character set it is just a literal dash.
So how do you use character range? To illustrate the concept of character
ranges let’s create a basic
vector with some simple strings, and see
what the different ranges match:
Now consider the following vector triplets
:
You can use a series of character ranges to match various occurrences of
a certain type of character. For example, to match three consecutive digits
you can define a pattern "[0-9][0-9][0-9]"
; to match three consecutive
lower case letters you can use the pattern "[a-z][a-z][a-z]"
; and the
same idea applies to a pattern that matches three consecutive upper case
letters "[A-Z][A-Z][A-Z]"
.
Observe that the element ":-)"
is not matched by any of the character ranges
that we have seen so far.
Character ranges can be defined in multiple ways. For example, the range
"[1-3]"
indicates any one digit 1, 2, or 3. Another range may be defined
as "[5-9]"
comprising any one digit 5, 6, 7, 8 or 9. The same idea applies
to letters. You can define shorter ranges other than "[a-z]"
. One example
is "[a-d]"
which consists of any one lettere a, b, c, and d.
39.7 Negative Character Sets
A common situation when working with regular expressions consists of matching
characters that are NOT part of a certain set. This type of matching can be
done using a negative character set: by matching any one character that is not
in the set. To define this type of sets you are going to use the metacharacter
caret "^"
. If you are using a QWERTY keyboard, the caret symbol should be
located over the key with the number 6.
The caret "^"
is one of those metacharacters that have more than one meaning
depending on where it appears in a pattern. If you use a caret in the first
position inside a character set, e.g. [^aeiou]
, it means negation. In
other words, the caret in [^aeiou]
means “not any one of lower case vowels.”
Let’s use the basic
vector previously defined:
To match those elements that are NOT upper case letters, you define a negative
character range "[^A-Z]"
:
It is important that the caret is the first character inside the character set, otherwise the set is not a negative one:
In the example above, the pattern "[A-Z^]"
means “any one upper case letter
or the caret character.” Which is completely different from the negative set
"[^A-Z]"
that negates any one upper case letter.
If you want to match any character except the caret, then you need to use a
character set with two carets: "[^^]"
. The first caret works as a negative
operator, the second caret is the caret character itself:
39.8 Character Classes
Closely related with character sets and character ranges, regular expressions provide another useful construct called character classes which, as their name indicates, are used to match a certain class of characters. The most common character classes in most regex engines are:
Character | Matches | Same as |
---|---|---|
\\d |
any digit | [0-9] |
\\D |
any nondigit | [^0-9] |
\\w |
any character considered part of a word | [a-zA-Z0-9_] |
\\W |
any character not considered part of a word | [^a-zA-Z0-9_] |
\\s |
any whitespace character | [\f\n\r\t\v] |
\\S |
any nonwhitespace character | [^\f\n\r\t\v] |
You can think of character classes as another type of metacharacters, or as shortcuts for special character sets.
The following table shows the characters that represent whitespaces:
Character | Description |
---|---|
\f |
form feed |
\n |
line feed |
\r |
carriage return |
\t |
tab |
\v |
vertical tab |
Sometimes you have to deal with nonprinting whitespace characters. In these
situations you probably will end up using the whitespace character class \\s
.
A common example is when you have to match tab characters, or line breaks.
The operating system Windows uses \r\n
as an end-of-line marker. In contrast,
Unix-like operating systems (including Mac OS) use \n
.
Tab characters \t
are commonly used as a field-separator for data files. But
most text editors render them as whitespaces.
39.9 POSIX Character Classes
We finish this chapter with the introduction of another type of character classes known as POSIX character classes. These are yet another class construct that is supported by the regex engine in R.
Class | Description | Same as |
---|---|---|
[:alnum:] |
any letter or digit | [a-zA-Z0-9] |
[:alpha:] |
any letter | [a-zA-Z] |
[:digit:] |
any digit | [0-9] |
[:lower:] |
any lower case letter | [a-z] |
[:upper:] |
any upper case letter | [A-Z] |
[:space:] |
any whitespace inluding space | [\f\n\r\t\v ] |
[:punct:] |
any punctuation symbol | |
[:print:] |
any printable character | |
[:graph:] |
any printable character excluding space | |
[:xdigit:] |
any hexadecimal digit | [a-fA-F0-9] |
[:cntrl:] |
ASCII control characters |
Notice that a POSIX character class is formed by an opening bracket [
,
followed by a colon :
, followed by some keyword, followed by another
colon :
, and finally a closing bracket ]
.
In order to use them in R, you have to wrap a POSIX class inside a regex character class. That is, you have to surround a POSIX class with brackets.
Once again, refer to the pnx
vector to illustrate the use of POSIX classes:
Let’s start with the [:alpha:]
class, and see what does it match in pnx
:
Now let’s test it with [:digit:]