9 Character Sets

In this chapter we will talk about character sets. You will learn about a couple of more metacharacters, the opening and closing brackets [ ], that will help you define a character set.

These square brackets indicate a character set which will match any one of the various characters that are inside the set. Keep in mind that a character set will match only one character. The order of the characters inside the set does not matter; what matter is just the presence of the characters inside the brackets. So for example if you have a set defined by "[AEIOU]", that will match any one upper case vowel.

9.1 Defining character sets

Let’s bring back the vector pns

pns <- c('pan', 'pen', 'pin', 'pon', 'pun')
pns

[1] "pan" "pen" "pin" "pon" "pun"

Consider the following pattern that includes a character set: "p[aeiou]n". As you can tell, this pattern is formed by the character p, folled by the set [aeiou], followed by the character n. What does this pattern match? Let’s find out with str_view()

str_view(pns, "p[aeiou]n")

The set "p[aeiou]n" matches all elements in pns.

For comparison purposes, let’s create another vector pnx

pnx <- c('pan', 'pen', 'pin', 'p0n', 'p.n', 'p1n', 'paun')
pnx

[1] "pan"  "pen"  "pin"  "p0n"  "p.n"  "p1n"  "paun"

And let’s test again the pattern "p[aeiou]n" to see what elements are matched

str_view(pnx, "p[aeiou]n")

This time only the first three elements in pnx are matched. Notice also that paun is not matched. This is because the character set matches only one character, either a or u but not au.

If you are interested in matching all capital letters in English, you can define a set formed as:

[ABCDEFGHIJKLMNOPQRSTUVWXYZ]

Likewise, you can define a set with only lower case letters in English:

[abcdefghijklmnopqrstuvwxyz]

If you are interested in matching any digit, you can also specify a character set like this:

[0123456789]

9.2 Character ranges

The previous examples that show character sets containing all the capital letters or all lower case letters are very convenient but require a lot of typing. Character ranges are going to help you solve that problem, by giving you a convenient shortcut based on the dash metacharacter "-" to indicate a range of characters. A character range consists of a character set with two characters separated by a dash or minus "-" sign.

Let’s see how you can reexpress the examples in the previous section as character ranges. The set of all digits can be expressed as a character range using the following pattern:

[0-9]

Likewise, the set of all lower case letters abcd…xyz is compactly represented with the character range:

[a-z]

And the character set of all upper case letters ABD…XYZ is formed by

[A-Z]

Note that the dash is only a metacharacter when it is inside a character set; outside the character set it is just a literal dash.

So how do you use character range? To illustrate the concept of character ranges let’s create a basic vector with some simple strings, and see what the different ranges match:

basic <- c('1', 'a', 'A', '&', '-', '^')

First, we use the range [0-9] to match any single digit

# digits
str_view(basic, '[0-9]')

Then we use the range of lower case letters [a-z]

# lower case letters
str_view(basic, '[a-z]')

And finally the range of upper case letters [A-Z]

# upper case letters
str_view(basic, '[A-Z]')

Now consider the following vector triplets:

triplets <- c('123', 'abc', 'ABC', ':-)')

You can use a series of character ranges to match various occurrences of a certain type of character. For example, to match three consecutive digits you can define a pattern "[0-9][0-9][0-9]"; to match three consecutive lower case letters you can use the pattern "[a-z][a-z][a-z]"; and the same idea applies to a pattern that matches three consecutive upper case letters "[A-Z][A-Z][A-Z]".

str_view(triplets, '[0-9][0-9][0-9]')

str_view(triplets, '[A-Z][A-Z][A-Z]')

Observe that the element ":-)" is not matched by any of the character ranges that we have seen so far.

Character ranges can be defined in multiple ways. For example, the range "[1-3]" indicates any one digit 1, 2, or 3. Another range may be defined as "[5-9]" comprising any one digit 5, 6, 7, 8 or 9. The same idea applies to letters. You can define shorter ranges other than "[a-z]". One example is "[a-d]" which consists of any one letter a, b, c, and d.

Pattern	Description
`[aeiou]`	match any one lower case vowel
`[AEIOU]`	match any one upper case vowel
`[0123456789]`	match any digit
`[0-9]`	match any digit (same as previous class)
`[a-z]`	match any lower case ASCII letter
`[A-Z]`	match any upper case ASCII letter
`[a-zA-Z0-9]`	match any of the above classes
`[^aeiou]`	match anything other than a lowercase vowel
`[^0-9]`	match anything other than a digit

9.3 Negative Character Sets

A common situation when working with regular expressions consists of matching characters that are NOT part of a certain set. This type of matching can be done using a negative character set: by matching any one character that is not in the set. To define this type of sets you are going to use the metacharacter caret "^". If you are using a QWERTY keyboard, the caret symbol should be located over the key with the number 6.

The caret "^" is one of those metacharacters that have more than one meaning depending on where it appears in a pattern. If you use a caret in the first position inside a character set, e.g. [^aeiou], it means negation. In other words, the caret in [^aeiou] means “not any one of lower case vowels.”

Let’s use the basic vector previously defined:

basic <- c('1', 'a', 'A', '&', '-', '^')
basic

[1] "1" "a" "A" "&" "-" "^"

To match those elements that are NOT upper case letters, you define a negative character range "[^A-Z]":

str_view(basic, '[^A-Z]')

It is important that the caret is the first character inside the character set, otherwise the set is not a negative one:

str_view(basic, '[A-Z^]')

In the example above, the pattern "[A-Z^]" means “any one upper case letter or the caret character.” Which is completely different from the negative set "[^A-Z]" that negates any one upper case letter.

If you want to match any character except the caret, then you need to use a character set with two carets: "[^^]". The first caret works as a negative operator, the second caret is the caret character itself:

str_view(basic, '[^^]')

9.4 Metacharacters Inside Character Sets

Now that you know what character sets are, how to define character ranges, and how to specify negative character sets, we need to talk about what happens when including metacharacters inside character sets.

Except for the caret in the first position of the character set, any other metacharacter inside a character set is already escaped. This implies that you do not need to escape them using backslashes.

To illustrate the use of metacharacters inside character sets, let’s use the pnx vector:

pnx <- c('pan', 'pen', 'pin', 'p0n', 'p.n', 'p1n', 'paun')
pnx

The character set formed by "p[ae.io]n" includes the dot character. Remember that, in general, the period is the wildcard metacharacter and it matches any type of character. However, the period in this example is inside a character set, and because of that, it loses its wildcard behavior.

str_view(pnx, "p[ae.io]n")

As you can tell, "p[ae.io]n" matches pan, pen, pin and p.n, but not p0n or p1n because the dot is the literal dot, not a wildcard character anymore.

Not all metacharacters become literal characters when they appear inside a character set. The exceptions are the closing bracket ], the dash -, the caret ^, and the backslash \.

The closing bracket ] is used to enclose the character set. Thus, if you want to use a literal right bracket inside a character set you must escape it:

"[aei\\[ou]"

Remember that in R you use double backslash for escaping purposes. This is also why the backslash \, or double backslash in R, does not become a literal character.

Another interesting case has to do with the dash or hyphen - character. As you know, the dash inside a character set is used to define a range of characters: e.g. [0-9], [x-z], and [K-P]. As a general rule, if you want to include a literal dash as part of a range, you should escape it: "[a-z\\-]".

Let’s modify the basic vector by adding an opening and ending brackets:

basic <- c('1', 'a', 'A', '&', '-', '^', '[', ']')

How do you match each of the characters that have a special meaning inside a character set?

# matching a literal caret
str_view(basic, "[a\\^]")

# matching a literal dash
str_view(basic, "[a\\-]")

# matching a literal opening bracket
str_view(basic, "[a\\[]")

# matching a literal closing bracket
str_view(basic, "[a\\]]")

9.5 Character Classes

Closely related with character sets and character ranges, regular expressions provide another useful construct called character classes which, as their name indicates, are used to match a certain class of characters. The most common character classes in most regex engines are:

Character	Matches	Same as
`\\d`	any digit	`[0-9]`
`\\D`	any nondigit	`[^0-9]`
`\\s`	any whitespace character	`[\f\n\r\t\v]`
`\\S`	any nonwhitespace character	`[^\f\n\r\t\v]`
`\\w`	any character considered part of a word	`[a-zA-Z0-9_]`
`\\W`	any character not considered part of a word	`[^a-zA-Z0-9_]`
`\\b`	any word boundary
`\\B`	any non-(word boundary)

You can think of character classes as another type of metacharacters, or as shortcuts for special character sets.

# replace digit with '_'
str_replace("the dandelion war 2010", "\\d", "_")

[1] "the dandelion war _010"

str_replace_all("the dandelion war 2010", "\\d", "_")

[1] "the dandelion war ____"

# replace non-digit with '_'
str_replace("the dandelion war 2010", "\\D", "_")

[1] "_he dandelion war 2010"

str_replace_all("the dandelion war 2010", "\\D", "_")

[1] "__________________2010"

Spaces and non-spaces

# replace space with '_'
str_replace("the dandelion war 2010", "\\s", "_")

[1] "the_dandelion war 2010"

str_replace_all("the dandelion war 2010", "\\s", "_")

[1] "the_dandelion_war_2010"

# replace non-space with '_'
str_replace("the dandelion war 2010", "\\S", "_")

[1] "_he dandelion war 2010"

str_replace_all("the dandelion war 2010", "\\S", "_")

[1] "___ _________ ___ ____"

Words and non-words

# replace word with '_'
str_replace("the dandelion war 2010", "\\b", "_")

[1] "_the dandelion war 2010"

str_replace_all("the dandelion war 2010", "\\b", "_")

[1] "_the_ _dandelion_ _war_ _2010_"

# replace non-word with '_'
str_replace("the dandelion war 2010", "\\B", "_")

[1] "t_he dandelion war 2010"

str_replace_all("the dandelion war 2010", "\\B", "_")

[1] "t_h_e d_a_n_d_e_l_i_o_n w_a_r 2_0_1_0"

Word boundaries and non-word-boundaries

# replace word boundary with '_'
str_replace("the dandelion war 2010", "\\w", "_")

[1] "_he dandelion war 2010"

str_replace_all("the dandelion war 2010", "\\w", "_")

[1] "___ _________ ___ ____"

# replace non-word boundary with '_'
str_replace("the dandelion war 2010", "\\W", "_")

[1] "the_dandelion war 2010"

str_replace_all("the dandelion war 2010", "\\W", "_")

[1] "the_dandelion_war_2010"

The following table shows the characters that represent whitespaces:

Character	Description
`\f`	form feed
`\n`	line feed
`\r`	carriage return
`\t`	tab
`\v`	vertical tab

Sometimes you have to deal with nonprinting whitespace characters. In these situations you probably will end up using the whitespace character class \\s. A common example is when you have to match tab characters, or line breaks.

The operating system Windows uses \r\n as an end-of-line marker. In contrast, Unix-like operating systems (including Mac OS) use \n.

Tab characters \t are commonly used as a field-separator for data files. But most text editors render them as whitespaces.

9.6 POSIX Character Classes

We finish this chapter with the introduction of another type of character classes known as POSIX character classes. These are yet another class construct that is supported by the regex engine in R.

Class	Description	Same as
`[:alnum:]`	any letter or digit	`[a-zA-Z0-9]`
`[:alpha:]`	any letter	`[a-zA-Z]`
`[:digit:]`	any digit	`[0-9]`
`[:lower:]`	any lower case letter	`[a-z]`
`[:upper:]`	any upper case letter	`[A-Z]`
`[:space:]`	any whitespace including space	`[\f\n\r\t\v ]`
`[:punct:]`	any punctuation symbol
`[:print:]`	any printable character
`[:graph:]`	any printable character excluding space
`[:xdigit:]`	any hexadecimal digit	`[a-fA-F0-9]`
`[:cntrl:]`	ASCII control characters

Notice that a POSIX character class is formed by an opening bracket [, followed by a colon :, followed by some keyword, followed by another colon :, and finally a closing bracket ].

In order to use them in R, you have to wrap a POSIX class inside a regex character class. That is, you have to surround a POSIX class with brackets.

Once again, refer to the pnx vector to illustrate the use of POSIX classes:

pnx <- c('pan', 'pen', 'pin', 'p0n', 'p.n', 'p1n', 'paun')

Let’s start with the [:alpha:] class, and see what does it match in pnx:

str_view(pnx, "[[:alpha:]]")

Now let’s test it with [:digit:]

str_view(pnx, "[[:digit:]]")

Here’s another example, suppose we are dealing with the following string:

# la vie (string)
la_vie <- "La vie en #FFC0CB (rose);\nCes't la vie! \ttres jolie"

# if you print 'la_vie'
print(la_vie)

[1] "La vie en #FFC0CB (rose);\nCes't la vie! \ttres jolie"

# if you cat 'la_vie'
cat(la_vie)

La vie en #FFC0CB (rose);
Ces't la vie!   tres jolie

Here’s what would happen to the string la_vie if we apply some substitutions with the POSIX character classes:

# remove space characters
str_replace_all(la_vie, pattern="[[:blank:]]", replacement='')

[1] "Lavieen#FFC0CB(rose);\nCes'tlavie!tresjolie"

# remove digits
str_replace_all(la_vie, pattern="[[:punct:]]", replacement='')

[1] "La vie en FFC0CB rose\nCest la vie \ttres jolie"

# remove digits
str_replace_all(la_vie, pattern="[[:xdigit:]]", replacement='')

[1] "L vi n # (ros);\ns't l vi! \ttrs joli"

# remove printable characters
str_replace_all(la_vie, pattern="[[:print:]]", replacement='')

[1] "\n\t"

# remove non-printable characters
str_replace_all(la_vie, pattern="[^[:print:]]", replacement='')

[1] "La vie en #FFC0CB (rose);Ces't la vie! tres jolie"

# remove graphical characters
str_replace_all(la_vie, pattern="[[:graph:]]", replacement='')

[1] "    \n   \t "

# remove non-graphical characters
str_replace_all(la_vie, pattern="[^[:graph:]]", replacement='')

[1] "Lavieen#FFC0CB(rose);Ces'tlavie!tresjolie"