10 Character Sets

In this chapter we will talk about character sets. You will learn about a couple of more metacharacters, the opening and closing brackets [ ], that will help you define a character set.

These square brackets indicate a character set which will match any one of the various characters that are inside the set. Keep in mind that a character set will match only one character. The order of the characters inside the set does not matter; what matter is just the presence of the characters inside the brackets. So for example if you have a set defined by "[AEIOU]", that will match any one upper case vowel.

10.1 Defining character sets

Consider the following pattern that includes a character set: "p[aeiou]n", and a vector with the words pan, pen, and pin:

The set "p[aeiou]n" matches all elements in pns. Now let’s use the same set with another vector pnx:

As you can tell, this time only the first three elements in pnx are matched. Notice also that paun is not matched. This is because the character set matches only one character, either a or u but not au.

If you are interested in matching all capital letters in English, you can define a set formed as:

[ABCDEFGHIJKLMNOPQRSTUVWXYZ]

Likewise, you can define a set with only lower case letters in English:

[abcdefghijklmnopqrstuvwxyz]

If you are interested in matching any digit, you can also specify a character set like this:

[0123456789]

10.2 Character ranges

The previous examples that show character sets containing all the capital letters or all lower case letters are very convenient but require a lot of typing. Character ranges are going to help you solve that problem, by giving you a convenient shortcut based on the dash metacharacter "-" to indicate a range of characters. A character range consists of a character set with two characters separated by a dash or minus "-" sign.

Let’s see how you can reexpress the examples in the previous section as character ranges. The set of all digits can be expressed as a character range using the following pattern:

[0-9]

Likewise, the set of all lower case letters abcd…xyz is compactly represented with the character range:

[a-z]

And the character set of all upper case letters ABD…XYZ is formed by

[A-Z]

Note that the dash is only a metacharacter when it is inside a character set; outside the character set it is just a literal dash.

So how do you use character range? To illustrate the concept of character ranges let’s create a basic vector with some simple strings, and see what the different ranges match:

Now consider the following vector triplets:

You can use a series of character ranges to match various occurrences of a certain type of character. For example, to match three consecutive digits you can define a pattern "[0-9][0-9][0-9]"; to match three consecutive lower case letters you can use the pattern "[a-z][a-z][a-z]"; and the same idea applies to a pattern that matches three consecutive upper case letters "[A-Z][A-Z][A-Z]".

Observe that the element ":-)" is not matched by any of the character ranges that we have seen so far.

Character ranges can be defined in multiple ways. For example, the range "[1-3]" indicates any one digit 1, 2, or 3. Another range may be defined as "[5-9]" comprising any one digit 5, 6, 7, 8 or 9. The same idea applies to letters. You can define shorter ranges other than "[a-z]". One example is "[a-d]" which consists of any one lettere a, b, c, and d.

10.3 Negative Character Sets

A common situation when working with regular expressions consists of matching characters that are NOT part of a certain set. This type of matching can be done using a negative character set: by matching any one character that is not in the set. To define this type of sets you are going to use the metacharacter caret "^". If you are using a QWERTY keyboard, the caret symbol should be located over the key with the number 6.

The caret "^" is one of those metacharacters that have more than one meaning depending on where it appears in a pattern. If you use a caret in the first position inside a character set, e.g. [^aeiou], it means negation. In other words, the caret in [^aeiou] means “not any one of lower case vowels.”

Let’s use the basic vector previously defined:

To match those elements that are NOT upper case letters, you define a negative character range "[^A-Z]":

It is important that the caret is the first character inside the character set, otherwise the set is not a negative one:

In the example above, the pattern "[A-Z^]" means “any one upper case letter or the caret character.” Which is completely different from the negative set "[^A-Z]" that negates any one upper case letter.

If you want to match any character except the caret, then you need to use a character set with two carets: "[^^]". The first caret works as a negative operator, the second caret is the caret character itself:

10.4 Metacharacters inside character sets

Now that you know what character sets are, how to define character ranges, and how to specify negative character sets, we need to talk about what happens when including metacharacters inside character sets.

Except for the caret in the first position of the character set, any other metacharacter inside a character set is already escaped. This implies that you do not need to escape them using backslashes.

To illustrate the use of metacharacters inside character sets, let’s use the pnx vector:

The character set formed by "p[ae.io]n" includes the dot character. Remember that, in general, the period is the wildcard metacharacter and it matches any type of character. However, the period in this example is inside a character set, and because of that, it loses its wildcard behavior.

As you can tell, "p[ae.io]n" matches pan, pen, pin and p.n, but not p0n or p1n because the dot is the literal dot, not a wildcard character anymore.

Not all metacharacters become literal characters when they appear inside a character set. The exceptions are the closing bracket ], the dash -, the caret ^, and the backslash \.

The closing bracket ] is used to enclose the character set. Thus, if you want to use a literal right bracket inside a character set you must escape it: [aei\\[ou]. Remember that in R you use double backslash for escaping purposes. This is also why the backslash \, or double backslash in R, does not become a literal character.

Another interesting case has to do with the dash or hyphen - character. As you know, the dash inside a character set is used to define a range of characters: e.g. [0-9], [x-z], and [K-P]. As a general rule, if you want to include a literal dash as part of a range, you should escape it: "[a-z\\-]".

Let’s modify the basic vector by adding an opening and ending brackets:

How do you match each of the characters that have a special meaning inside a character set?

10.5 Character Classes

Closely related with character sets and character ranges, regular expressions provide another useful construct called character classes which, as their name indicates, are used to match a certain class of characters. The most common character classes in most regex engines are:

Character Matches Same as
\\d any digit [0-9]
\\D any nondigit [^0-9]
\\w any character considered part of a word [a-zA-Z0-9_]
\\W any character not considered part of a word [^a-zA-Z0-9_]
\\s any whitespace character [\f\n\r\t\v]
\\S any nonwhitespace character [^\f\n\r\t\v]

You can think of character classes as another type of metacharacters, or as shortcuts for special character sets.

The following table shows the characters that represent whitespaces:

Character Description
\f form feed
\n line feed
\r carriage return
\t tab
\v vertical tab

Sometimes you have to deal with nonprinting whitespace characters. In these situations you probably will end up using the whitespace character class \\s. A common example is when you have to match tab characters, or line breaks.

The operating system Windows uses \r\n as an end-of-line marker. In contrast, Unix-like operating systems (including Mac OS) use \n.

Tab characters \t are commonly used as a field-separator for data files. But most text editors render them as whitespaces.

10.6 POSIX Character Classes

We finish this chapter with the introduction of another type of character classes known as POSIX character classes. These are yet another class construct that is supported by the regex engine in R.

Class Description Same as
[:alnum:] any letter or digit [a-zA-Z0-9]
[:alpha:] any letter [a-zA-Z]
[:digit:] any digit [0-9]
[:lower:] any lower case letter [a-z]
[:upper:] any upper case letter [A-Z]
[:space:] any whitespace inluding space [\f\n\r\t\v ]
[:punct:] any punctuation symbol
[:print:] any printable character
[:graph:] any printable character excluding space
[:xdigit:] any hexadecimal digit [a-fA-F0-9]
[:cntrl:] ASCII control characters

Notice that a POSIX character class is formed by an opening bracket [, followed by a colon :, followed by some keyword, followed by another colon :, and finally a closing bracket ].

In order to use them in R, you have to wrap a POSIX class inside a regex character class. That is, you have to surround a POSIX class with brackets.

Once again, refer to the pnx vector to illustrate the use of POSIX classes:

Let’s start with the [:alpha:] class, and see what does it match in pnx:

Now let’s test it with [:digit:]