15 Strings

Now that we’ve seen vectors and factors, let’s spend some time talking about character data, also referred to as “strings” in the programming world.

Most of the material in this chapter is borrowed from Gaston Sanchez’s book Handling Strings in R (with permission from the author).

15.1 Text Everywhere

At its heart, computing involves working with numbers. That’s the main reason why computers were invented: to facilitate mathematical operations around numbers; from basic arithmetic to more complex operations (e.g. trigonometry, algebra, calculus, etc.) Nowadays, however, we use computers to work with data that are not just numbers. We use them to write a variety of documents, we use them to create and edit images and videos, to manipulate sound, among many other tasks.

Today, there is a considerable amount of information and data in the form of text. Look at any website: pretty much the contents are text and images, with some videos here and there, and maybe some tables and/or a list of numbers.

Likewise, most of the times you are going to be working with text files: script files, reports, data files, source code files, etc. All the R script files that you use are essentially plain text files. I bet you have a csv file or any other field delimited format (or even in HTML, XML, JSON, etc), with some fields containing characters. In all of these cases what you are working with is essentially a bunch of characters.

15.2 Character Strings in R

In R, a piece of text is represented as a sequence of characters (letters, numbers, and symbols). The data type R provides for storing sequences of characters is character. Formally, the mode of an object that holds character strings in R is "character".

You express character strings by surrounding text within double quotes:

"a character string using double quotes"

or you can also surround text within single quotes:

'a character string using single quotes'

The important thing is that you must match the type of quotes that your are using. A starting double quote must have an ending double quote. Likewise, a string with an opening single quote must be closed with a single quote.

Typing characters in R like in above examples is not very useful. Typically, you are going to create objects or variables containing some strings. For example, you can create a variable string that stores some string:

string <- 'do more with less'
string
#> [1] "do more with less"

Notice that when you print a character object, R displays it using double quotes (regardless of whether the string was created using single or double quotes). This allows you to quickly identify when an object contains character values.

When writing strings, you can insert single quotes in a string with double quotes, and vice versa:

# single quotes within double quotes
ex1 <- "The 'R' project for statistical computing"

# double quotes within single quotes
ex2 <- 'The "R" project for statistical computing'

However, you cannot directly insert single quotes in a string with single quotes, neither you can insert double quotes in a string with double quotes (Don’t do this!):

ex3 <- "This "is" totally unacceptable"

ex4 <- 'This 'is' absolutely wrong'

In both cases R will give you an error due to the unexpected presence of either a double quote within double quotes, or a single quote within single quotes.

If you really want to include a double quote as part of the string, you need to escape the double quote using a backslash \ before it:

"The \"R\" project for statistical computing"

15.2.1 Empty String

The most basic type of string is the empty string produced by consecutive quotation marks: "". Technically, "" is a string with no characters in it, hence the name “empty string”:

# empty string
empty_str <- ""
empty_str
#> [1] ""

# class
class(empty_str)
#> [1] "character"

15.2.2 Empty Character Vector

Another basic string structure is the empty character vector produced by the function character() and its argument length = 0:

# empty character vector
empty_chr <- character(0)
empty_chr
#> character(0)

# class
class(empty_chr)
#> [1] "character"

It is important not to confuse the empty character vector character(0) with the empty string ""; one of the main differences between them is that they have different lengths:

# length of empty string
length(empty_str)
#> [1] 1

# length of empty character vector
length(empty_chr)
#> [1] 0

Notice that the empty string empty_str has length 1, while the empty character vector empty_chr has length 0.

Related to character() R provides two related functions: as.character() and is.character(). These two functions are methods for coercing objects to type "character", and testing whether an R object is of type "character". For instance, let’s define two objects a and b as follows:

# define two objects 'a' and 'b'
a <- "test me"
b <- 8 + 9

To test if a and b are of type "character" use the function is.character():

# are 'a' and 'b' characters?
is.character(a)
#> [1] TRUE

is.character(b)
#> [1] FALSE

The function as.character() is a coercing method. For better or worse, R allows you to convert (i.e. coerce) non-character objects into character strings with the function as.character():

# converting 'b' as character
b <- as.character(b)
b
#> [1] "17"

15.2.3 The workhorse function `paste()`

The function paste() is perhaps one of the most important functions that you can use to create and build strings. paste() takes one or more R objects, converts them to "character", and then it concatenates (pastes) them to form one or several character strings. Its usage has the following form:

paste(..., sep = " ", collapse = NULL)

The argument ... means that it takes any number of objects. The argument sep is a character string that is used as a separator. The argument collapse is an optional string to indicate if you want all the terms to be collapsed into a single string. Here is a simple example with paste():

# paste
PI <- paste("The life of", pi)

PI
#> [1] "The life of 3.14159265358979"

As you can see, the default separator is a blank space (sep = " "). But you can select another character, for example sep = "-":

# paste
IloveR <- paste("I", "love", "R", sep = "-")

IloveR
#> [1] "I-love-R"

If you give paste() objects of different length, then it will apply a recycling rule. For example, if you paste a single character "X" with the sequence 1:5, and separator sep = ".", this is what you get:

# paste with objects of different lengths
paste("X", 1:5, sep = ".")
#> [1] "X.1" "X.2" "X.3" "X.4" "X.5"

To see the effect of the collapse argument, let’s compare the difference with collapsing and without it:

# paste with collapsing
paste(1:3, c("!","?","+"), sep = '', collapse = "")
#> [1] "1!2?3+"

# paste without collapsing
paste(1:3, c("!","?","+"), sep = '')
#> [1] "1!" "2?" "3+"

One of the potential problems with paste() is that it coerces missing values NA into the character "NA":

# with missing values NA
evalue <- paste("the value of 'e' is", exp(1), NA)

evalue
#> [1] "the value of 'e' is 2.71828182845905 NA"

In addition to paste(), there’s also the function paste0() which is a wrapper of paste() equivalent to:

paste(..., sep = "", collapse)

# collapsing with paste0
paste0("let's", "collapse", "all", "these", "words")
#> [1] "let'scollapseallthesewords"

15.3 Basic Manipulations

Besides creating and printing strings, there are a number of very handy functions in R for doing some basic manipulation of strings. In this chapter
we will review the following functions:

Function	Description
`nchar()`	number of characters
`tolower()`	convert to lower case
`toupper()`	convert to upper case
`casefold()`	case folding
`chartr()`	character translation
`abbreviate()`	abbreviation
`substring()`	substrings of a character vector
`substr()`	substrings of a character vector

15.3.1 Counting characters

One of the main functions for manipulating character strings is nchar() which counts the number of characters in a string. In other words, nchar() provides the length of a string:

# how many characters?
nchar(c("How", "many", "characters?"))
#> [1]  3  4 11

# how many characters?
nchar("How many characters?")
#> [1] 20

Notice that the white spaces between words in the second example are also counted as characters.

It is important not to confuse nchar() with length(). While the former gives us the number of characters, the later only gives the number of elements in a vector.

# how many elements?
length(c("How", "many", "characters?"))
#> [1] 3

# how many elements?
length("How many characters?")
#> [1] 1

15.3.2 Casefolding

R comes with three functions for text casefolding.

tolower()
toupper()
casefold()

The function tolower() converts any upper case characters into lower case:

# to lower case
tolower(c("aLL ChaRacterS in LoweR caSe", "ABCDE"))
#> [1] "all characters in lower case" "abcde"

The opposite function of tolower() is toupper. As you may guess, this function converts any lower case characters into upper case:

# to upper case
toupper(c("All ChaRacterS in Upper Case", "abcde"))
#> [1] "ALL CHARACTERS IN UPPER CASE" "ABCDE"

The third function for case-folding is casefold() which is a wrapper for both tolower() and toupper(). Its uasge has the following form:

casefold(x, upper = FALSE)

By default, casefold() converts all characters to lower case, but you can use the argument upper = TRUE to indicate the opposite (characters in upper case):

# lower case folding
casefold("aLL ChaRacterS in LoweR caSe")
#> [1] "all characters in lower case"

# upper case folding
casefold("All ChaRacterS in Upper Case", upper = TRUE)
#> [1] "ALL CHARACTERS IN UPPER CASE"

I’ve found the case-folding functions to be very helpful when I write functions that take a character input which may be specified in lower or upper case, or perhaps as a mix of both cases. For instance, consider the function temp_convert() that takes a temperature value in Fahrenheit degress, and a character string indicating the name of the scale to be converted.

temp_convert <- function(deg = 1, to = "celsius") {
  switch(to,
         "celsius" = (deg - 32) * (5/9),
         "kelvin" = (deg + 459.67) * (5/9),
         "reaumur" = (deg - 32) * (4/9),
         "rankine" = deg + 459.67)
}

Here is how you call temp_convert() to convert 10 Fahrenheit degrees into celsius degrees:

temp_convert(deg = 10, to = "celsius")
#> [1] -12.2

temp_convert() works fine when the argument to = 'celsius'. But what happens if you try temp_convert(30, 'Celsius') or temp_convert(30, 'CELSIUS')?

To have a more flexible function temp_convert() you can apply tolower() to the argument to, and in this way guarantee that the provided string by the user is always in lower case:

temp_convert <- function(deg = 1, to = "celsius") {
  switch(tolower(to),
         "celsius" = (deg - 32) * (5/9),
         "kelvin" = (deg + 459.67) * (5/9),
         "reaumur" = (deg - 32) * (4/9),
         "rankine" = deg + 459.67)
}

Now all the following three calls are equivalent:

temp_convert(30, 'celsius')
temp_convert(30, 'Celsius')
temp_convert(30, 'CELSIUS')

15.3.3 Translating characters

There’s also the function chartr() which stands for character translation. chartr() takes three arguments: an old string, a new string, and a character vector x:

chartr(old, new, x)

The way chartr() works is by replacing the characters in old that appear in x by those indicated in new. For example, suppose we want to translate the letter "a" (lower case) with "A" (upper case) in the sentence "This is a boring string":

# replace 'a' by 'A'
chartr("a", "A", "This is a boring string")
#> [1] "This is A boring string"

It is important to note that old and new must have the same number of characters, otherwise you will get a nasty error message like this one:

# incorrect use
chartr("ai", "X", "This is a bad example")
#> Error in chartr("ai", "X", "This is a bad example"): 'old' is longer than 'new'

Here’s a more interesting example with old = "aei" and new = "\#!?". This implies that any 'a' in 'x' will be replaced by '\#', any 'e' in 'x' will be replaced by '?', and any 'i' in 'x' will be replaced by '?':

# multiple replacements
crazy <- c("Here's to the crazy ones", "The misfits", "The rebels")
chartr("aei", "#!?", crazy)
#> [1] "H!r!'s to th! cr#zy on!s" "Th! m?sf?ts"             
#> [3] "Th! r!b!ls"

15.3.4 Abbreviating strings

Another useful function for basic manipulation of character strings is abbreviate(). Its usage has the following structure:

abbreviate(names.org, minlength = 4, dot = FALSE, strict = FALSE,
            method = c("left.keep", "both.sides"))

Although there are several arguments, the main parameter is the character vector (names.org) which will contain the names that we want to abbreviate:

# some color names
some_colors <- colors()[1:4]
some_colors
#> [1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"

# abbreviate (default usage)
colors1 <- abbreviate(some_colors)
colors1
#>         white     aliceblue  antiquewhite antiquewhite1 
#>        "whit"        "alcb"        "antq"        "ant1"

# abbreviate with 'minlength'
colors2 <- abbreviate(some_colors, minlength = 5)
colors2
#>         white     aliceblue  antiquewhite antiquewhite1 
#>       "white"       "alcbl"       "antqw"       "antq1"

# abbreviate
colors3 <- abbreviate(some_colors, minlength = 3, method = "both.sides")
colors3
#>         white     aliceblue  antiquewhite antiquewhite1 
#>         "wht"         "alc"         "ant"         "an1"

A common use for abbreviate() is when plotting names of objects or variables in a graphic. I will use the built-in data set mtcars to show you a simple example with a scatterplot between variables mpg and disp

plot(mtcars$mpg, mtcars$disp, type = "n")
text(mtcars$mpg, mtcars$disp, rownames(mtcars))

The names of the cars are all over the plot. In this situation you may want to consider using abbreviate() to shrink the names of the cars and produce a less “crowded” plot:

plot(mtcars$mpg, mtcars$disp, type = "n")
text(mtcars$mpg, mtcars$disp, abbreviate(rownames(mtcars)))

15.3.5 Replacing strings

One common operation when working with strings is the extraction and replacement of some characters. There a various ways in which characters can be replaced. If the replacement is based on the positions that characters occupy in the string, you can use the functions substr() and substring()

substr() extracts or replaces substrings in a character vector. Its usage has the following form:

substr(x, start, stop)

x is a character vector, start indicates the first element to be replaced, and stop indicates the last element to be replaced:

# extract 'bcd'
substr("abcdef", 2, 4)
#> [1] "bcd"

# replace 2nd letter with hash symbol
x <- c("may", "the", "force", "be", "with", "you")
substr(x, 2, 2) <- "#"
x
#> [1] "m#y"   "t#e"   "f#rce" "b#"    "w#th"  "y#u"

# replace 2nd and 3rd letters with happy face
y = c("may", "the", "force", "be", "with", "you")
substr(y, 2, 3) <- ":)"
y
#> [1] "m:)"   "t:)"   "f:)ce" "b:"    "w:)h"  "y:)"

# replacement with recycling
z <- c("may", "the", "force", "be", "with", "you")
substr(z, 2, 3) <- c("#", "```")
z
#> [1] "m#y"   "t``"   "f#rce" "b`"    "w#th"  "y``"

Closely related to substr() is the function substring() which extracts or replaces substrings in a character vector. Its usage has the following form:

substring(text, first, last = 1000000L)

text is a character vector, first indicates the first element to be replaced, and last indicates the last element to be replaced:

# same as 'substr'
substring("ABCDEF", 2, 4)
#> [1] "BCD"
substr("ABCDEF", 2, 4)
#> [1] "BCD"

# extract each letter
substring("ABCDEF", 1:6, 1:6)
#> [1] "A" "B" "C" "D" "E" "F"

# multiple replacement with recycling
text6 <- c("more", "emotions", "are", "better", "than", "less")
substring(text6, 1:3) <- c(" ", "zzz")
text6
#> [1] " ore"     "ezzzions" "ar "      "zzzter"   "t an"     "lezz"

15.4 Exercises

The following exercises are based on the row names of data frame USArrests (this data comes already in R).

# a few rows of USArrests
head(USArrests)
#>            Murder Assault UrbanPop Rape
#> Alabama      13.2     236       58 21.2
#> Alaska       10.0     263       48 44.5
#> Arizona       8.1     294       80 31.0
#> Arkansas      8.8     190       50 19.5
#> California    9.0     276       91 40.6
#> Colorado      7.9     204       78 38.7

For convenience reasons, let’s create a character vector states containing the row names of USArrests:

states <- rownames(USArrests)
head(states)
#> [1] "Alabama"    "Alaska"     "Arizona"    "Arkansas"   "California"
#> [6] "Colorado"

Here are some functions that you may need to use in order to answer the exercises listed below.

nchar()
tolower()
toupper()
casefold()
paste()
paste0()
substr()

1) Use nchar() on states to get the number of characters of each state.

2) There are 3 functions to do case-folding: tolower(), toupper(), and casefold(). Apply each function on states to see what happens.

3) Use paste() to form a new vector with the first five states and their number of characters like this:

"Alabama = 7" "Alaska = 6" "Arizona = 7" "Arkansas = 8" "California = 10"

4) Take the first five states, and find out how to use paste()’s argument collapse, in order to get the following output:

"AlabamaAlaskaArizonaArkansasCalifornia"

5) Use substr() on states to shorten the state names using the first 3-letters, e.g. "Ala" "Ala" "Ari" "Ark" ...

6) How would you shorten the state names using the first letter and the last 3-letters? For instance: "Aama" "Aska" "Aona" "Asas" ...?

7) How would you extract those elements in states that have four characters: e.g. "Iowa" "Ohio" "Utah"?

8) How would you those elements in states that have ten characters: e.g. "California" "New Jersey" "New Mexico" "Washington"?

9) Imagine that you need to generate the names of 10 data CSV files. All the files have the same prefix name but each of them has a different number: file1.csv, file2.csv, … , file10.csv. How can you use paste() and paste0() to generate a character vector with these names in R?

10) Consider the following vector letrs which contains various letters randomly generated:

# random vector of letters
set.seed(1)
letrs <- sample(letters, size = 100, replace = TRUE)
head(letrs)
#> [1] "y" "d" "g" "a" "b" "w"

Write code in R to count the number of vowels in vector letrs. Hint: see binary operator %in% (help documentation ?"%in%").
Write code in R to count the number of consonants in vector letrs. Hint: see binary operator %in% (help documentation ?"%in%").