6 Basic String Manipulations

6.1 Introduction

In this chapter you will learn about the different functions to do what I call basic manipulations. By “basic” I mean transforming and processing strings in such way that do not require the use of regular expressions. More advanced manipulations involve defining patterns of text and matching such patterns. This is the essential idea behind regular expressions, which is the content of part 5 in this book.

6.2 Basic Manipulations

Besides creating and printing strings, there are a number of very handy functions in R for doing some basic manipulation of strings. In this chapter
we will review the following functions:

Function Description
nchar() number of characters
tolower() convert to lower case
toupper() convert to upper case
casefold() case folding
chartr() character translation
abbreviate() abbreviation
substring() substrings of a character vector
substr() substrings of a character vector

6.3 Counting characters

One of the main functions for manipulating character strings is nchar() which counts the number of characters in a string. In other words, nchar() provides the length of a string:

Notice that the white spaces between words in the second example are also counted as characters.

It is important not to confuse nchar() with length(). While the former gives us the number of characters, the later only gives the number of elements in a vector.

6.4 Casefolding

R comes with three functions for text casefolding.

  1. tolower()
  2. toupper()
  3. casefold()

The function tolower() converts any upper case characters into lower case:

The opposite function of tolower() is toupper. As you may guess, this function converts any lower case characters into upper case:

The third function for case-folding is casefold() which is a wrapper for both tolower() and toupper(). Its uasge has the following form:

casefold(x, upper = FALSE)

By default, casefold() converts all characters to lower case, but you can use the argument upper = TRUE to indicate the opposite (characters in upper case):

I’ve found the case-folding functions to be very helpful when I write functions that take a character input which may be specified in lower or upper case, or perhaps as a mix of both cases. For instance, consider the function temp_convert() that takes a temperature value in Fahrenheit degress, and a character string indicating the name of the scale to be converted.

Here is how you call temp_convert() to convert 10 Fahrenheit degrees into celsius degrees:

temp_convert() works fine when the argument to = 'celsius'. But what happens if you try temp_convert(30, 'Celsius') or temp_convert(30, 'CELSIUS')?

To have a more flexible function temp_convert() you can apply tolower() to the argument to, and in this way guarantee that the provided string by the user is always in lower case:

Now all the following three calls are equivalent:

6.5 Translating characters

There’s also the function chartr() which stands for character translation. chartr() takes three arguments: an old string, a new string, and a character vector x:

chartr(old, new, x)

The way chartr() works is by replacing the characters in old that appear in x by those indicated in new. For example, suppose we want to translate the letter "a" (lower case) with "A" (upper case) in the sentence "This is a boring string":

It is important to note that old and new must have the same number of characters, otherwise you will get a nasty error message like this one:

Here’s a more interesting example with old = "aei" and new = "\#!?". This implies that any 'a' in 'x' will be replaced by '\#', any 'e' in 'x' will be replaced by '?', and any 'i' in 'x' will be replaced by '?':

6.6 Abbreviating strings

Another useful function for basic manipulation of character strings is abbreviate(). Its usage has the following structure:

abbreviate(names.org, minlength = 4, dot = FALSE, strict = FALSE,
            method = c("left.keep", "both.sides"))

Although there are several arguments, the main parameter is the character vector (names.org) which will contain the names that we want to abbreviate:

A common use for abbreviate() is when plotting names of objects or variables in a graphic. I will use the built-in data set mtcars to show you a simple example with a scatterplot between variables mpg and disp

The names of the cars are all over the plot. In this situation you may want to consider using abbreviate() to shrink the names of the cars and produce a less “crowded” plot:

6.7 Replacing strings

One common operation when working with strings is the extraction and replacement of some characters. There a various ways in which characters can be replaced. If the replacement is based on the positions that characters occupy in the string, you can use the functions substr() and substring()

substr() extracts or replaces substrings in a character vector. Its usage has the following form:

substr(x, start, stop)

x is a character vector, start indicates the first element to be replaced, and stop indicates the last element to be replaced:

Closely related to substr() is the function substring() which extracts or replaces substrings in a character vector. Its usage has the following form:

substring(text, first, last = 1000000L)

text is a character vector, first indicates the first element to be replaced, and last indicates the last element to be replaced:

6.8 Set Operations

R has dedicated functions for performing set operations on two given vectors. This implies that we can apply functions such as set union, intersection, difference, equality and membership, on "character" vectors.

Function Description
union() set union
intersect() intersection
setdiff() set difference
setequal() equal sets
identical() exact equality
is.element() is element
%in%() contains
sort() sorting

6.8.1 Set union with union()

Let’s start our reviewing of set functions with union(). As its name indicates, you can use `union()} when you want to obtain the elements of the union between two character vectors:

Notice that union() discards any duplicated values in the provided vectors. In the previous example the word "some" appears twice inside set1 but it appears only once in the union. In fact all the set operation functions will discard any duplicated values.

6.8.2 Set intersection with intersect()

Set intersection is performed with the function intersect(). You can use this function when you wish to get those elements that are common to both vectors:

6.8.3 Set difference with setdiff()

Related to the intersection, you might be interested in getting the difference of the elements between two character vectors. This can be done with setdiff():

6.8.4 Set equality with setequal()

The function setequal() allows you to test the equality of two character vectors. If the vectors contain the same elements, setequal() returns TRUE (FALSE otherwise)

6.8.5 Exact equality with identical()

Sometimes setequal() is not always what we want to use. It might be the case that you want to test whether two vectors are exactly equal (element by element). For instance, testing if set7 is exactly equal to set9. Although both vectors contain the same set of elements, they are not exactly the same vector. Such test can be performed with the function identical()

If you consult the help documentation of identical(), you will see that this function is the “safe and reliable way to test two objects for being exactly equal”.

6.8.6 Element contained with is.element()

If you wish to test if an element is contained in a given set of character strings you can do so with is.element():

Alternatively, you can use the binary operator %in% to test if an element is contained in a given set. The function %in% returns TRUE if the first operand is contained in the second, and it returns FALSE otherwise: