8  Getting Started with Regular Expressions

So far you have learned some basic and intermediate functions for handling and working with text in R. These are very useful functions and they allow you to do many interesting things. However, if you truly want to unleash the power of strings manipulation, you need to take things to the next level and learn about regular expressions.

In this chapter, we use functions from the package "stringr"

library(stringr)

8.1 What are Regular Expressions?

The name “Regular Expression” does not say much. However, regular expressions are all about text. Think about how much text is all around you in our modern digital world: email, text messages, news articles, blogs, computer code, contacts in your address book—all these things are text. Regular expressions are a tool that allows us to work with these text by describing text patterns.

A regular expression is a special text string for describing a certain amount of text. This “certain amount of text” receives the formal name of pattern. In other words, a regular expression is a set of symbols that describes a text pattern. More formally we say that a regular expression is a pattern that describes a set of strings. In addition to this first meaning, the term regular expression can also be used in a slightly different but related way: as the formal language of these symbols that needs to be interpreted by a regular expression processor. Because the term “regular expression” is rather long, most people use the word regex as a shortcut term. And you will even find the plural regexes.

It is also worth noting what regular expressions are not. They’re not a programming language. They may look like some sort of programming language because they are a formal language with a defined set of rules that gets a computer to do what we want it to do. However, there are no variables in regex and you can’t do computations like adding 1 + 1.

8.1.1 What are Regular Expressions used for?

We use regular expressions to work with text. Some of its common uses involve testing if a phone number has the correct number of digits, if a date follows a specifc format (e.g. mm/dd/yy), if an email address is in a valid format, or if a password has numbers and special characters. You could also use regular expressions to search a document for gray spelt either as “gray” or “grey”. You could search a document and replace all occurrences of “Will”, “Bill”, or “W.” with William. Or you could count the number of times in a document that the word “analysis” is immediately preceded by the words “data”, “computer” or “statistical” only in those cases. You could use it to convert a comma-delimited file into a tab-delimited file or to find duplicate words in a text.

In each of these cases, you are going to use a regular expression to write up a description of what you are looking for using symbols. In the case of a phone number, that pattern might be three digits followed by a dash, followed by three digits and another dash, followed by four digits. Once you have defined a pattern then the regex processor will use our description to return matching results, or in the case of the test, to return true or false for whether or not it matched.

8.1.2 A word of caution about regex

If you have never used regular expressions before, their syntax may seem a bit scary and cryptic. You will see strings formed by a bunch of letters, digits, and other punctuation symbols combined in seemingly nonsensical ways. As with any other topic that has to do with programming and data analysis, learning the principles of regex and becoming fluent in defining regex patterns takes time and requires a lot of practice. The more you use them, the better you will become at defining more complex patterns and getting the most out of them.

Regular Expressions is a wide topic and there are books entirely dedicated to this subject. The material offered in this book is not extensive and there are many subtopics that I don’t cover here. Despite the initial barriers that you may encounter when entering the regex world, the pain and frustration of learning this tool will payoff in your data science career.

8.1.3 About Regular Expressions in R

Tools for working with regular expressions can be found in virtually all scripting languages (e.g. Perl, Python, Java, Ruby, etc). R has some functions for working with regular expressions but it does not provide the wide range of capabilities that other scripting languages do. Nevertheless, they can take you very far with some workarounds (and a bit of patience).

Although I am assuming that you are new to regex, I won’t cover everything there is to know about regular expressions. Instead, I will focus on how R works with regular expressions, as well as the R syntax that you will have to use for regex operations.

One of the best tools you must have in your toolkit is the R package "stringr" (by Hadley Wickham). It provides functions that have similar behavior to those of the base distribution in R. But it also provides many more facilities for working with regular expressions.

To know more about regular expressions in general, you can find some useful infor- mation in the following resources:

  • Regex wikipedia: For those readers who have no experience with regular expressions, a good place to start is by checking the wikipedia entrance.

http://en.wikipedia.org/wiki/Regular_expression

  • Regular-Expressions.info website (by Jan Goyvaerts): An excelent website full of information about regular expressions. It contains many different topics, resources, lots of examples, and tutorials, covered at both beginner and advanced levels.

http://www.regular-expressions.info

  • Mastering Regular Expressions (by Jeffrey Friedl): I wasn’t sure whether to include this reference but I think it deserves to be considered as well. This is perhaps the authoritative book on regular expressions. The only issue is that it is a book better adressed for readers already experienced with regex.

http://regex.info/book.html

8.2 Regex Basics

The main purpose of working with regular expressions is to describe patterns that are used to match against text strings. Simply put, working with regular expressions is nothing more than pattern matching. The result of a match is either successful or not.

The simplest version of pattern matching is to search for one occurrence (or all occurrences) of some specific characters in a string. For example, we might want to search for the word "programming" in a large text document, or we might want to search for all occurrences of the string "apply" in a series of files containing R scripts.

Typically, regular expression patterns consist of a combination of alphanumeric characters as well as special characters. A regex pattern can be as simple as a single character, or it can be formed by several characters with a more complex structure. In all cases we construct regular expressions much in the same form in which we construct arithmetic expressions, by using various operators to combine smaller expressions.

8.3 Literal Characters

We’re going to start with the simplest match of all: a literal character. A literal character match is one in which a given character such as the letter "R" matches the letter R. This type of match is the most basic type of regular expression operation: just matching plain text.

8.3.1 Matching Literal Characters

The following examples are extremely basic but they will help you get a good understanding of regex.

Consider the following text stored in a character vector this_book:

this_book <- 'This book is mine'

The first regular expression we are going to work with is "book". This pattern is formed by a letter b, followed by a letter o, followed by another letter o, followed by a letter k. As you may guess, this pattern matches the word book in the character vector this_book. To have a visual representation of the actual pattern that is matched, you should use the function str_view() from the package "stringr" (you may need to upgrade to a recent version of RStudio):

str_view(this_book, 'book')

As you can tell, the pattern "book" doesn’t match the entire content in the vector this_book; it just matches those four letters.

It may seem really simple but there are a couple of details to be highlighted. The first is that regex searches are case sensitive by default. This means that the pattern "Book" would not match book in this_book.

str_view("This Book is mine.", 'book')

You can change the matching task so that it is case insensitive but we will talk about it later.

Let’s add more text to this_book:

this_book <- 'This book is mine. I wrote this book with bookdown.'

Let’s use str_view() to see what pieces of text are matched in this_book with the pattern "book":

str_view(this_book, "book")

As you can tell, only the first occurrence of book was matched. This is a common behavior of regular expressions in which they return a match as fast possible. You can think of this behavior as the “eager principle”, that is, regular expressions are eager and they will give preference to an early match. This is a minor but important detail and we will come back to this behavior of regular expressions.

All the letters and digits in the English alphabet are considered literal characters. They are called literal because they match themselves.

str_view <- c("I had 3 quesadillas for lunch", "3")

Here is another example:

transport <- c("car", "bike", "boat", "airplane")

The first pattern to test is the letter "a":

str_view(transport, "a")

When you execute the previous command, you should be able to see that the letter "a" is highlighted in the words car, boat and airplane.

8.4 R Functions for Regular Expressions

In order to move on with the discussion of regular expressions, we need to talk about some of the functions available in R for regex.

8.4.1 Regex Functions in "base" Package

R contains a set of functions in the "base" package that we can use to find pattern matches. The following table lists these functions with a brief description:

Function Purpose Characteristic
grep() finding regex matches which elements are matched (index or value)
grepl() finding regex matches which elements are matched (TRUE,FALSE)
regexpr() finding regex matches positions of the first match
gregexpr() finding regex matches positions of all matches
regexec() finding regex matches hybrid of regexpr() and gregexpr()
sub() replacing regex matches only first match is replaced
gsub() replacing regex matches all matches are replaced
strsplit() splitting regex matches split vector according to matches

The first five functions listed in the previous table are used for finding pattern matches in character vectors. The goal is the same for all these functions: finding a match. The difference between them is in the format of the output. The next two functions—sub() and gsub()— are used for substitution: looking for matches with the purpose of replacing them. The last function, strsplit(), is used to split elements of a character vector into substrings according to regex matches.

Basically, all regex functions require two main arguments: a pattern (i.e. regular expression), and a text to match. Each function has other additional arguments but the main ones are a pattern and some text. In particular, the pattern is basically a character string containing a regular expression to be matched in the given text.

You can check the documentation of all the grep()-like functions by typing help(grep) (or alternatively ?grep).

# help documentation for main regex functions
help(grep)

8.4.2 Regex Functions in Package "stringr"

The R package "stringr" also provides several functions for regex operations (see table below). More specifically, "stringr" provides pattern matching functions to detect, locate, extract, match, replace and split strings.

Function Description
str_detect() Detect the presence or absence of a pattern in a string
str_extract() Extract first piece of a string that matches a pattern
str_extract_all() Extract all pieces of a string that match a pattern
str_match() Extract first matched group from a string
str_match_all() Extract all matched groups from a string
str_locate() Locate the position of the first occurence of a pattern in a string
str_locate_all() Locate the position of all occurences of a pattern in a string
str_replace() Replace first occurrence of a matched pattern in a string
str_replace_all() Replace all occurrences of a matched pattern in a string
str_split() Split up a string into a variable number of pieces
str_split_fixed() Split up a string into a fixed number of pieces

One of the important things to keep in mind is that all pattern matching functions in "stringr" have the following general form:

str_function(string, pattern)

The main two arguments are: a string vector to be processed , and a single pattern (i.e. regular expression) to match. Moreover, all the function names begin with the prefix str_, followed by the name of the action to be performed. For example, to locate the position of the first occurence, we should use str_locate(); to locate the positions of all matches we should use str_locate all().

8.4.3 Matching Literal Characters With "stringr" Functions

Having introduced the regex functions available in "stringr", let’s continue describing how to omatch literal characters. We had defined a string this_book

this_book <- 'This book is mine. I wrote this book with bookdown.'

We can use the function str_detect() to look for the pattern "book"

str_detect(string = this_book, pattern = "book")
[1] TRUE

If there is a match, then {str_detect() returns TRUE. Conversely, if there is no match, str_detect() will return FALSE

str_detect(string = this_book, pattern = 'tablet')
[1] FALSE

All the letters and digits in the English alphabet are considered literal characters. They are called literal because they match themselves.

str_detect <- c(string = "I had 3 quesadillas for lunch", pattern = "3")

Here is another example:

transport <- c("car", "bike", "boat", "airplane")

The first pattern to test is the letter "a":

str_view(string = transport, pattern = "a")

When you execute the previous command, you should be able to see that the letter "a" is highlighted in the words car, boat and airplane.

8.5 Metacharacters

The next topic that you should learn about regular expressions has to do with metacharacters. As you just learned, the most basic type of regular expressions are the literal characters which are characters that match themselves. However, not all characters match themselves. Any character that is not a literal character is a metacharacter.

8.5.1 About Metacharacters

Metacharacter are characters that have a special meaning and they allow you to transform literal characters in very interesting ways. Sometimes they act like mathematical operators: transforming literal characters into powerful expressions.

Below is the list of metacharacters in Extended Regular Expressions (EREs):

.   \   |   (   )   [   ]   {   }   $   -    ^   *   +   ?
  • the dot .
  • the backslash \
  • the bar |
  • left or opening parenthesis (
  • right or closing parenthesis )
  • left or opening bracket [
  • right or closing bracket ]
  • left or opening brace {
  • right or closing brace }
  • the dollar sign $
  • the dash, hyphen or minus sign -
  • the caret or hat ^
  • the star or asterisk *
  • the plus sign +
  • the question mark ?

For example, the pattern "money\$" does not match “money$”. Likewise, the pattern "what?" does not match “what?”. Except for a few cases, metacharacters have a special meaning and purporse when working with regular expressions.

We’re going to be working with these characters throughout the rest of the book. Simply put, everything else that you need to know about regular expressions besides literal characters is how these metacharacters work. The good news is that there are only a few metacharacters to learn. The bad news is that some metacharacters can have more than one meaning. And learning those meanings definitely takes time and requires hours of practice. The meaning of the metacharacters greatly depend on the context in which you use them, how you use them, and where you use them. If it wasn’t enough complication, it is also the metacharacters that have variation between the different regex engines.

8.6 The Wildcard Metacharacter

The first metacharacter you should learn about is the dot or period ".", better known as the wildcard metacharacter.

Like in many card games where one of the cards in the deck is wild and can be used to replace other types of cards, there is also a wild character in regex that has the same purpose—hence its name.

This metacharacter is used to match ANY character except for a new line.

For example, consider the pattern "p.n", that is, p wildcard n. This pattern will match pan, pen, and pin, but it will not match prun or plan. The dot only matches one single character.

pns <- c('pan', 'pen', 'pin', 'plan', 'prun', 'p n', 'p\nn')

str_detect(string = pns, pattern = 'p.n')
[1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE

Observe that "p.n" even matches the blank space, and the new line character "\n". The reason why it does not match plan is because the third character is an a and not an n.

Let’s see another example using the vector c("not", "note", "knot", "nut") and the pattern "n.t"

not <- c("not", "note", "knot", "nut")

str_view(not, "n.t")

the pattern "n.t" matches not in the first three elements, and nut in the last element.

If you specify a pattern "no.", then just the first three elements in not will be matched.

str_view(not, "no.")

And if you define a pattern "kn.", then only the third element is matched.

str_view(not, "kn.")

The wild metacharacter is probably the most used metacharacter, and it is also the most abused one, being the source of many mistakes. Here is a basic example with the regular expression formed by "5.00". If you think that this pattern will match five with two decimal places after it, you will be surprised to find out that it not only matches 5.00 but also 5100 and 5-00. Why? Because "." is the metacharacter that matches absolutely anything. You will learn how to fix this mistake in the next section, but it illustrates an important fact about regular expressions: the challenge consists of matching what you want, but also in matching only what you want. You don’t want to specify a pattern that is overly permissive. You want to find the thing you’re looking for, but only that thing.

As an experiment, try writing a pattern that will match silver, sliver, and slider. See if you can use wildcards to come up with a regular expression that will match all three of those.

sil <- c('silver', 'sliver', 'slider')

# your pattern
pat <- ...

# test it
str_detect(string = sil, pattern = pat)

8.6.1 Escaping metacharacters

What if you just want to match the character dot? For example, say you have the following vector:

fives <- c("5.00", "5100", "5-00", "5 00")

If you try the pattern "5.00", it will match all of the elements in fives.

str_view(fives, "5.00")

To actually match the dot character, what you need to do is escape the metacharacter. In most languages, the way to escape a metacharacter is by adding a backslash character in front of the metacharacter: "\.". When you use a backslash in front of a metacharacter you are “escaping” the character, this means that the character no longer has a special meaning, and it will match itself.

However, R is a bit different. Instead of using a backslash you have to use two backslashes: "5\\.00". This is because the backslash "\", which is another metacharacter, has a special meaning in R. Therefore, to match just the element 5.00 in fives in R, you do it like so:

str_view(fives, "5\\.00")

The following list shows the general regex metacharacters and how to escape them in R:

Metachacter Literal meaning Escape in R
. the period or dot "\\."
$ the dollar sign "\\$"
* the asterisk or star "\\*"
+ the plus sign "\\+"
? the question mark "\\?"
| the vertical bar "\\|"
\ the backslah "\\\\"
^ the caret or hat "\\^"
[ the opening bracket "\\["
] the closing bracket "\\]"
{ the opening brace "\\{"
} the closing brace "\\}"
( the opening parenthesis "\\("
) the closing parenthesis "\\)"

Here are some silly examples that show how to escape metacharacters in R in order to be replaced with an empty "":

# dollar
str_replace("$Peace-Love", "\\$", "")
[1] "Peace-Love"
# dot
str_replace("Peace.Love", "\\.", "")
[1] "PeaceLove"
# plus
str_replace("Peace+Love", "\\+", "")
[1] "PeaceLove"
# caret
str_replace("Peace^Love", "\\^", "")
[1] "PeaceLove"
# vertical bar
str_replace("Peace|Love", "\\|", "")
[1] "PeaceLove"
# opening round bracket
str_replace("Peace(Love)", "\\(", "")
[1] "PeaceLove)"
# closing round bracket
str_replace("Peace(Love)", "\\)", "")
[1] "Peace(Love"
# opening square bracket
str_replace("Peace[Love]", "\\[", "")
[1] "PeaceLove]"
# closing square bracket
str_replace("Peace[Love]", "\\]", "")
[1] "Peace[Love"
# opening curly bracket
str_replace("Peace{Love}", "\\{", "")
[1] "PeaceLove}"
# closing curly bracket
str_replace("Peace{Love}", "\\}", "")
[1] "Peace{Love"
# double backslash
str_replace("Peace\\Love", "\\\\", "")
[1] "PeaceLove"