12 Character Sets
In this chapter we will talk about character sets. You will learn about
a couple of more metacharacters, the opening and closing brackets [ ]
, that
will help you define a character set.
These square brackets indicate a character set which will match any one of
the various characters that are inside the set. Keep in mind that a character
set will match only one character. The order of the characters inside the set
does not matter; what matter is just the presence of the characters inside the
brackets. So for example if you have a set defined by "[AEIOU]"
, that will
match any one upper case vowel.
12.1 Defining character sets
Consider the following pattern that includes a character set: "p[aeiou]n"
,
and a vector with the words pan, pen, and pin:
The set "p[aeiou]n"
matches all elements in pns
. Now let’s use the same
set with another vector pnx
:
As you can tell, this time only the first three elements in pnx
are matched.
Notice also that paun is not matched. This is because the character set
matches only one character, either a or u but not au.
If you are interested in matching all capital letters in English, you can define a set formed as:
[ABCDEFGHIJKLMNOPQRSTUVWXYZ]
Likewise, you can define a set with only lower case letters in English:
[abcdefghijklmnopqrstuvwxyz]
If you are interested in matching any digit, you can also specify a character set like this:
[0123456789]
12.2 Character ranges
The previous examples that show character sets containing all the capital
letters or all lower case letters are very convenient but require a lot of
typing. Character ranges are going to help you solve that problem, by
giving you a convenient shortcut based on the dash metacharacter "-"
to
indicate a range of characters. A character range consists of a character set
with two characters separated by a dash or minus "-"
sign.
Let’s see how you can reexpress the examples in the previous section as character ranges. The set of all digits can be expressed as a character range using the following pattern:
[0-9]
Likewise, the set of all lower case letters abcd…xyz is compactly represented with the character range:
[a-z]
And the character set of all upper case letters ABD…XYZ is formed by
[A-Z]
Note that the dash is only a metacharacter when it is inside a character set; outside the character set it is just a literal dash.
So how do you use character range? To illustrate the concept of character
ranges let’s create a basic
vector with some simple strings, and see
what the different ranges match:
Now consider the following vector triplets
:
You can use a series of character ranges to match various occurrences of
a certain type of character. For example, to match three consecutive digits
you can define a pattern "[0-9][0-9][0-9]"
; to match three consecutive
lower case letters you can use the pattern "[a-z][a-z][a-z]"
; and the
same idea applies to a pattern that matches three consecutive upper case
letters "[A-Z][A-Z][A-Z]"
.
Observe that the element ":-)"
is not matched by any of the character ranges
that we have seen so far.
Character ranges can be defined in multiple ways. For example, the range
"[1-3]"
indicates any one digit 1, 2, or 3. Another range may be defined
as "[5-9]"
comprising any one digit 5, 6, 7, 8 or 9. The same idea applies
to letters. You can define shorter ranges other than "[a-z]"
. One example
is "[a-d]"
which consists of any one lettere a, b, c, and d.
12.3 Negative Character Sets
A common situation when working with regular expressions consists of matching
characters that are NOT part of a certain set. This type of matching can be
done using a negative character set: by matching any one character that is not
in the set. To define this type of sets you are going to use the metacharacter
caret "^"
. If you are using a QWERTY keyboard, the caret symbol should be
located over the key with the number 6.
The caret "^"
is one of those metacharacters that have more than one meaning
depending on where it appears in a pattern. If you use a caret in the first
position inside a character set, e.g. [^aeiou]
, it means negation. In
other words, the caret in [^aeiou]
means “not any one of lower case vowels.”
Let’s use the basic
vector previously defined:
To match those elements that are NOT upper case letters, you define a negative
character range "[^A-Z]"
:
It is important that the caret is the first character inside the character set, otherwise the set is not a negative one:
In the example above, the pattern "[A-Z^]"
means “any one upper case letter
or the caret character.” Which is completely different from the negative set
"[^A-Z]"
that negates any one upper case letter.
If you want to match any character except the caret, then you need to use a
character set with two carets: "[^^]"
. The first caret works as a negative
operator, the second caret is the caret character itself:
12.4 Metacharacters inside character sets
Now that you know what character sets are, how to define character ranges, and how to specify negative character sets, we need to talk about what happens when including metacharacters inside character sets.
Except for the caret in the first position of the character set, any other metacharacter inside a character set is already escaped. This implies that you do not need to escape them using backslashes.
To illustrate the use of metacharacters inside character sets, let’s use
the pnx
vector:
The character set formed by "p[ae.io]n"
includes the dot character. Remember
that, in general, the period is the wildcard metacharacter and it matches
any type of character. However, the period in this example is inside a
character set, and because of that, it loses its wildcard behavior.
As you can tell, "p[ae.io]n"
matches pan, pen, pin and p.n,
but not p0n or p1n because the dot is the literal dot, not a wildcard
character anymore.
Not all metacharacters become literal characters when they appear inside a
character set. The exceptions are the closing bracket ]
, the dash -
,
the caret ^
, and the backslash \
.
The closing bracket ]
is used to enclose the character set. Thus, if you
want to use a literal right bracket inside a character set you must escape it:
[aei\\[ou]
. Remember that in R you use double backslash for escaping
purposes. This is also why the backslash \
, or double backslash in R,
does not become a literal character.
Another interesting case has to do with the dash or hyphen -
character. As
you know, the dash inside a character set is used to define a range of
characters: e.g. [0-9]
, [x-z]
, and [K-P]
. As a general rule, if you
want to include a literal dash as part of a range, you should escape it:
"[a-z\\-]"
.
Let’s modify the basic
vector by adding an opening and ending brackets:
How do you match each of the characters that have a special meaning inside a character set?
12.5 Character Classes
Closely related with character sets and character ranges, regular expressions provide another useful construct called character classes which, as their name indicates, are used to match a certain class of characters. The most common character classes in most regex engines are:
Character | Matches | Same as |
---|---|---|
\\d |
any digit | [0-9] |
\\D |
any nondigit | [^0-9] |
\\w |
any character considered part of a word | [a-zA-Z0-9_] |
\\W |
any character not considered part of a word | [^a-zA-Z0-9_] |
\\s |
any whitespace character | [\f\n\r\t\v] |
\\S |
any nonwhitespace character | [^\f\n\r\t\v] |
You can think of character classes as another type of metacharacters, or as shortcuts for special character sets.
The following table shows the characters that represent whitespaces:
Character | Description |
---|---|
\f |
form feed |
\n |
line feed |
\r |
carriage return |
\t |
tab |
\v |
vertical tab |
Sometimes you have to deal with nonprinting whitespace characters. In these
situations you probably will end up using the whitespace character class \\s
.
A common example is when you have to match tab characters, or line breaks.
The operating system Windows uses \r\n
as an end-of-line marker. In contrast,
Unix-like operating systems (including Mac OS) use \n
.
Tab characters \t
are commonly used as a field-separator for data files. But
most text editors render them as whitespaces.
12.6 POSIX Character Classes
We finish this chapter with the introduction of another type of character classes known as POSIX character classes. These are yet another class construct that is supported by the regex engine in R.
Class | Description | Same as |
---|---|---|
[:alnum:] |
any letter or digit | [a-zA-Z0-9] |
[:alpha:] |
any letter | [a-zA-Z] |
[:digit:] |
any digit | [0-9] |
[:lower:] |
any lower case letter | [a-z] |
[:upper:] |
any upper case letter | [A-Z] |
[:space:] |
any whitespace inluding space | [\f\n\r\t\v ] |
[:punct:] |
any punctuation symbol | |
[:print:] |
any printable character | |
[:graph:] |
any printable character excluding space | |
[:xdigit:] |
any hexadecimal digit | [a-fA-F0-9] |
[:cntrl:] |
ASCII control characters |
Notice that a POSIX character class is formed by an opening bracket [
,
followed by a colon :
, followed by some keyword, followed by another
colon :
, and finally a closing bracket ]
.
In order to use them in R, you have to wrap a POSIX class inside a regex character class. That is, you have to surround a POSIX class with brackets.
Once again, refer to the pnx
vector to illustrate the use of POSIX classes:
Let’s start with the [:alpha:]
class, and see what does it match in pnx
:
Now let’s test it with [:digit:]
Make a donation
If you find this resource useful, please consider making a one-time donation in any amount. Your support really matters.