2 Character Strings in R

2.1 Introduction

This chapter introduces you to the basic concepts for creating character vectors and character strings in R. You will also learn how R treats objects containing characters.

In R, a piece of text is represented as a sequence of characters (letters, numbers, and symbols). The data type R provides for storing sequences of characters is character. Formally, the mode of an object that holds character strings in R is "character".

You express character strings by surrounding text within double quotes:

or you can also surround text within single quotes:

The important thing is that you must match the type of quotes that your are using. A starting double quote must have an ending double quote. Likewise, a string with an opening single quote must be closed with a single quote.

Typing characters in R like in above examples is not very useful. Typically, you are going to create objects or variables containing some strings. For example, you can create a variable string that stores some string:

Notice that when you print a character object, R displays it using double quotes (regardless of whether the string was created using single or double quotes). This allows you to quickly identify when an object contains character values.

When writing strings, you can insert single quotes in a string with double quotes, and vice versa:

However, you cannot directly insert single quotes in a string with single quotes, neither you can insert double quotes in a string with double quotes (Don’t do this!):

In both cases R will give you an error due to the unexpected presence of either a double quote within double quotes, or a single quote within single quotes.

If you really want to include a double quote as part of the string, you need to escape the double quote using a backslash \ before it:

We will talk more about escaping characters in the following chapters.

2.2 Common use of strings in R

Perhaps the most common use of character strings in R has to do with:

  • names of files and directories
  • names of elements in data objects
  • text elements displayed in plots and graphs

When you read a file, for instance a data table stored in a csv file, you typically use the read.table() function and friends—e.g. read.csv(), read.delim(). Assuming that the file dataset.csv is in your working directory:

The main parameter for the function read.csv() is file which requires a character string with the pathname of the file.

Another example of a basic use of characters is when you assign names to the elements of some data structure in R. For instance, if you want to name the elements of a (numeric) vector, you can use the function names() as follows:

Likewise, many of the parameters in the plotting functions require some sort of input string. Below is a hypothetical example of a scatterplot that includes several graphical elements like the main title (main), subtitle (sub), labels for both x-axis and y-axis (xlab, ylab), the name of the color (col), and the symbol for the point character (pch).

2.3 Creating Character Strings

Besides the single quotes '' or double quotes "", R provides the function character() to create character vectors. More specifically, character() is the function that creates vector objects of type "character".

When using character() you just have to specify the length of the vector. The output will be a character vector filled of empty strings:

When would you use character()? A typical usage case is when you want to initialize an empty character vector of a given length. The idea is to create an object that you will modify later with some computation.

As with any other vector, once an empty character vector has been created, you can add new components to it by simply giving it an index value outside its previous range:

You can add more elements without the need to follow a consecutive index range:

Notice that the vector example went from containing one-element to contain four-elements without specifying the second and third elements. R fills this gap with missing values NA.

2.3.1 Empty string

The most basic type of string is the empty string produced by consecutive quotation marks: "". Technically, "" is a string with no characters in it, hence the name “empty string”:

2.3.2 Empty character vector

Another basic string structure is the empty character vector produced by the function character() and its argument length=0:

It is important not to confuse the empty character vector character(0) with the empty string ""; one of the main differences between them is that they have different lengths:

Notice that the empty string empty_str has length 1, while the empty character vector empty_chr has length 0.

Also, character(0) occurs when you have a character vector with one or more elements, and you attempt to subset the position 0:

If you try to retrieve the element in position 0 you get:

2.3.3 Function c()

There is also the generic function c() (concatenate or combine) that you can use to create character vectors. Simply pass any number of character elements separated by commas:

Again, notice that you can use single or double quotes to define the character elements inside c()

2.3.4 is.character() and as.character()

Related to character() R provides two related functions: as.character() and is.character(). These two functions are methods for coercing objects to type "character", and testing whether an R object is of type "character". For instance, let’s define two objects a and b as follows:

To test if a and b are of type "character" use the function is.character():

Likewise, you can also use the function class() to get the class of an object:

The function as.character() is a coercing method. For better or worse, R allows you to convert (i.e. coerce) non-character objects into character strings with the function as.character():

2.4 Behavior of R objects with character strings

The main, and most basic, type of objects in R are vectors. Vectors must have their values all of the same mode. This means that any given vector must be unambiguously either logical, numeric, complex, character or raw. In R we say that vectors are atomic structures, with their elements having all the same type or mode.

So what happens when you mix different types of data in a vector?

As you can tell, the resulting vector from combining integers 1:5, the number pi, and some "text" is a vector with all its elements treated as character strings. In other words, when you combine mixed data in vectors, strings will dominate. This means that the mode of the vector will be "character", even if you mix logical values:

In fact, R follows two basic rules of data types coercion. The most strict rule is: if a character string is present in a vector, everything else in the vector will be converted to character strings. The other coercing rule is: if a vector only has logicals and numbers, then logicals will be converted to numbers; TRUE values become 1, and FALSE values become 0.

Keeping these rules in mind will save you from many headaches and frustrating moments. Moreover, you can use them in your favor to manipulate data in very useful ways.

Matrices. The same behavior of vectors happens when you mix characters and numbers in matrices. Again, everything will be treated as characters:

Data frames. With data frames, things are a bit different. By default, character strings inside a data frame will be converted to factors:

To turn-off the data.frame()’s default behavior of converting strings into factors, use the argument stringsAsFactors = FALSE:

Even though df1 and df2 are identically displayed, their structure is different. While df1$letters is stored as a "factor", df2$letters is stored as a "character".

Lists. With lists, you can combine any type of data objects. The type of data in each element of the list will maintain its corresponding mode:

2.5 The workhorse function paste()

The function paste() is perhaps one of the most important functions that you can use to create and build strings. paste() takes one or more R objects, converts them to "character", and then it concatenates (pastes) them to form one or several character strings. Its usage has the following form:

paste(..., sep = " ", collapse = NULL)

The argument ... means that it takes any number of objects. The argument sep is a character string that is used as a separator. The argument collapse is an optional string to indicate if you want all the terms to be collapsed into a single string. Here is a simple example with paste():

As you can see, the default separator is a blank space (sep = " "). But you can select another character, for example sep = "-":

If you give paste() objects of different length, then it will apply a recycling rule. For example, if you paste a single character "X" with the sequence 1:5, and separator sep = ".", this is what you get:

To see the effect of the collapse argument, let’s compare the difference with collapsing and without it:

One of the potential problems with paste() is that it coerces missing values NA into the character "NA":

In addition to paste(), there’s also the function paste0() which is the equivalent of

paste(..., sep = "", collapse)

2.6 Exercises

  1. What is the difference between the empty character "" and the output of invoking character()?

  2. When you combine logical values, numeric values, and character values in one single vector, what will be the mode of the resulting vector?

  3. Using rep(), how would you obtain the following character vectors:

#> [1] "a" "b" "a" "b" "a" "b" "a" "b"
#> [1] "a" "a" "a" "a" "b" "b" "b" "b"
#> [1] "a" "a" "b" "b" "a" "a" "b" "b"
  1. Given the following vectors go, bears, and bang:

how would you use paste() and paste0() to get the following strings:

"Go Bears !"
"GoBears!"
"Go Bears!"
"Go-Bears!"
"Go Bears!!!"
"Go Bears! Go Bears! Go Bears!"