2 Character Strings in R

2.1 Introduction

This chapter introduces you to the basic concepts for creating character vectors and character strings in R. You will also learn how R treats objects containing characters.

2.2 Characters in R

In R, a piece of text is represented as a sequence of characters (letters, numbers, and symbols). The data type R provides for storing sequences of characters is character. Formally, the mode of an object that holds character strings in R is "character".

You express character strings by surrounding text within double quotes:

"a character string using double quotes"

or you can also surround text within single quotes:

'a character string using single quotes'

The important thing is that you must match the type of quotes that your are using. A starting double quote must have an ending double quote. Likewise, a string with an opening single quote must be closed with a single quote.

Typing characters in R like in above examples is not very useful. Typically, you are going to create objects or variables containing some strings. For example, you can create a variable string that stores some string:

string <- 'do more with less'
string
#> [1] "do more with less"

Notice that when you print a character object, R displays it using double quotes (regardless of whether the string was created using single or double quotes). This allows you to quickly identify when an object contains character values.

When writing strings, you can insert single quotes in a string with double quotes, and vice versa:

# single quotes within double quotes
ex1 <- "The 'R' project for statistical computing"

# double quotes within single quotes
ex2 <- 'The "R" project for statistical computing'

However, you cannot directly insert single quotes in a string with single quotes, neither you can insert double quotes in a string with double quotes (Don’t do this!):

ex3 <- "This "is" totally unacceptable"

ex4 <- 'This 'is' absolutely wrong'

In both cases R will give you an error due to the unexpected presence of either a double quote within double quotes, or a single quote within single quotes.

If you really want to include a double quote as part of the string, you need to escape the double quote using a backslash \ before it:

"The \"R\" project for statistical computing"

We will talk more about escaping characters in the following chapters.

2.3 Getting Started with Strings

Perhaps the most common use of character strings in R has to do with:

names of files and directories
names of elements in data objects
text elements displayed in plots and graphs

When you read a file, for instance a data table stored in a csv file, you typically use the read.table() function and friends—e.g. read.csv(), read.delim(). Assuming that the file dataset.csv is in your working directory:

dat <- read.csv(file = 'dataset.csv')

The main parameter for the function read.csv() is file which requires a character string with the pathname of the file.

Another example of a basic use of characters is when you assign names to the elements of some data structure in R. For instance, if you want to name the elements of a (numeric) vector, you can use the function names() as follows:

num_vec <- 1:5
names(num_vec) <- c('uno', 'dos', 'tres', 'cuatro', 'cinco')
num_vec

Likewise, many of the parameters in the plotting functions require some sort of input string. Below is a hypothetical example of a scatterplot that includes several graphical elements like the main title (main), subtitle (sub), labels for both x-axis and y-axis (xlab, ylab), the name of the color (col), and the symbol for the point character (pch).

plot(x, y, 
     main = 'Main Title', 
     sub = 'Subtitle',
     xlab = 'x-axis label', 
     ylab = 'y-axis label',
     col = 'red', 
     pch = 'x')

2.4 Creating Character Strings

Besides the single quotes '' or double quotes "", R provides the function character() to create character vectors. More specifically, character() is the function that creates vector objects of type "character".

When using character() you just have to specify the length of the vector. The output will be a character vector filled of empty strings:

# character vector with 5 empty strings
char_vector <- character(5)
char_vector
#> [1] "" "" "" "" ""

When would you use character()? A typical usage case is when you want to initialize an empty character vector of a given length. The idea is to create an object that you will modify later with some computation.

As with any other vector, once an empty character vector has been created, you can add new components to it by simply giving it an index value outside its previous range:

# another example
example <- character(0)
example
#> character(0)

# check its length
length(example)
#> [1] 0

# add first element
example[1] <- "first"
example
#> [1] "first"

# check its length again
length(example)
#> [1] 1

You can add more elements without the need to follow a consecutive index range:

example[4] <- "fourth"
example
#> [1] "first"  NA       NA       "fourth"
length(example)
#> [1] 4

Notice that the vector example went from containing one-element to contain four-elements without specifying the second and third elements. R fills this gap with missing values NA.

2.4.1 Empty string

The most basic type of string is the empty string produced by consecutive quotation marks: "". Technically, "" is a string with no characters in it, hence the name “empty string”:

# empty string
empty_str <- ""
empty_str
#> [1] ""

# class
class(empty_str)
#> [1] "character"

2.4.2 Empty character vector

Another basic string structure is the empty character vector produced by the function character() and its argument length=0:

# empty character vector
empty_chr <- character(0)
empty_chr
#> character(0)

# class
class(empty_chr)
#> [1] "character"

It is important not to confuse the empty character vector character(0) with the empty string ""; one of the main differences between them is that they have different lengths:

# length of empty string
length(empty_str)
#> [1] 1

# length of empty character vector
length(empty_chr)
#> [1] 0

Notice that the empty string empty_str has length 1, while the empty character vector empty_chr has length 0.

Also, character(0) occurs when you have a character vector with one or more elements, and you attempt to subset the position 0:

string <- c('sun', 'sky', 'clouds')
string
#> [1] "sun"    "sky"    "clouds"

If you try to retrieve the element in position 0 you get:

string[0]
#> character(0)

2.4.3 Function `c()`

There is also the generic function c() (concatenate or combine) that you can use to create character vectors. Simply pass any number of character elements separated by commas:

string <- c('sun', 'sky', 'clouds')
string
#> [1] "sun"    "sky"    "clouds"

Again, notice that you can use single or double quotes to define the character elements inside c()

planets <- c("mercury", 'venus', "mars")
planets
#> [1] "mercury" "venus"   "mars"

2.4.4 `is.character()` and `as.character()`

Related to character() R provides two related functions: as.character() and is.character(). These two functions are methods for coercing objects to type "character", and testing whether an R object is of type "character". For instance, let’s define two objects a and b as follows:

# define two objects 'a' and 'b'
a <- "test me"
b <- 8 + 9

To test if a and b are of type "character" use the function is.character():

# are 'a' and 'b' characters?
is.character(a)
#> [1] TRUE

is.character(b)
#> [1] FALSE

Likewise, you can also use the function class() to get the class of an object:

# classes of 'a' and 'b'
class(a)
#> [1] "character"
class(b)
#> [1] "numeric"

The function as.character() is a coercing method. For better or worse, R allows you to convert (i.e. coerce) non-character objects into character strings with the function as.character():

# converting 'b' as character
b <- as.character(b)
b
#> [1] "17"

2.5 Strings and R Objects

Before continuing our discussion on functions for manipulating strings, we need to talk about some important technicalities. R has five main types of objects to store data: vector, factor, matrix (and array), data.frame, and list. We can use each of those objects to store character strings. However, these objects will behave differently depending on whether we store character data with other types of data. Let’s see how R treats objects with different types of data (e.g. character, numeric, logical).

2.5.1 Behavior of R objects with character strings

Vectors. The most basic type of data container are vectors. You can think of vectors as the building blocks for other more complex data structures. R has six types of vectors, technically referred to as atomic types or atomic vectors: logical, integer, double, character, complex, and raw.

Type	Description
`logical`	a vector containing logical values
`integer`	a vector containing integer values
`double`	a vector containing real values
`character`	a vector containing character values
`complex`	a vector containing complex values
`raw`	a vector containing bytes

Vectors are atomic structures because their values must be all of the same type. This means that any given vector must be unambiguously either logical, numeric, complex, character or raw.

So what happens when you mix different types of data in a vector?

# vector with numbers and characters
c(1:5, pi, "text")
#> [1] "1"                "2"                "3"                "4"               
#> [5] "5"                "3.14159265358979" "text"

As you can tell, the resulting vector from combining integers 1:5, the number pi, and some "text" is a vector with all its elements treated as character strings. In other words, when you combine mixed data in vectors, strings will dominate. This means that the mode of the vector will be "character", even if you mix logical values:

# vector with numbers, logicals, and characters
c(1:5, TRUE, pi, "text", FALSE)
#> [1] "1"                "2"                "3"                "4"               
#> [5] "5"                "TRUE"             "3.14159265358979" "text"            
#> [9] "FALSE"

In fact, R follows two basic rules of data types coercion. The most strict rule is: if a character string is present in a vector, everything else in the vector will be converted to character strings. The other coercing rule is: if a vector only has logicals and numbers, then logicals will be converted to numbers; TRUE values become 1, and FALSE values become 0.

Keeping these rules in mind will save you from many headaches and frustrating moments. Moreover, you can use them in your favor to manipulate data in very useful ways.

Matrices. The same behavior of vectors happens when you mix characters and numbers in matrices. Again, everything will be treated as characters:

# matrix with numbers and characters
rbind(1:5, letters[1:5])
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,] "1"  "2"  "3"  "4"  "5" 
#> [2,] "a"  "b"  "c"  "d"  "e"

Data frames. With data frames, things are a bit different. By default, character strings inside a data frame will be converted to factors:

# data frame with numbers and characters
df1 = data.frame(numbers=1:5, letters=letters[1:5])
df1
#>   numbers letters
#> 1       1       a
#> 2       2       b
#> 3       3       c
#> 4       4       d
#> 5       5       e
# examine the data frame structure
str(df1)
#> 'data.frame':    5 obs. of  2 variables:
#>  $ numbers: int  1 2 3 4 5
#>  $ letters: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5

To turn-off the data.frame()’s default behavior of converting strings into factors, use the argument stringsAsFactors = FALSE:

# data frame with numbers and characters
df2 <- data.frame(
  numbers = 1:5, 
  letters = letters[1:5], 
  stringsAsFactors = FALSE)

df2
#>   numbers letters
#> 1       1       a
#> 2       2       b
#> 3       3       c
#> 4       4       d
#> 5       5       e
# examine the data frame structure
str(df2)
#> 'data.frame':    5 obs. of  2 variables:
#>  $ numbers: int  1 2 3 4 5
#>  $ letters: chr  "a" "b" "c" "d" ...

Even though df1 and df2 are identically displayed, their structure is different. While df1$letters is stored as a "factor", df2$letters is stored as a "character".

Lists. With lists, you can combine any type of data objects. The type of data in each element of the list will maintain its corresponding mode:

# list with elements of different mode
list(1:5, letters[1:5], rnorm(5))
#> [[1]]
#> [1] 1 2 3 4 5
#> 
#> [[2]]
#> [1] "a" "b" "c" "d" "e"
#> 
#> [[3]]
#> [1] -0.507  0.192 -1.172 -2.088 -2.106

2.6 The Workhorse Function `paste()`

The function paste() is perhaps one of the most important functions that you can use to create and build strings. paste() takes one or more R objects, converts them to "character", and then it concatenates (pastes) them to form one or several character strings. Its usage has the following form:

paste(..., sep = " ", collapse = NULL)

The argument ... means that it takes any number of objects. The argument sep is a character string that is used as a separator. The argument collapse is an optional string to indicate if you want all the terms to be collapsed into a single string. Here is a simple example with paste():

# paste
PI <- paste("The life of", pi)

PI
#> [1] "The life of 3.14159265358979"

As you can see, the default separator is a blank space (sep = " "). But you can select another character, for example sep = "-":

# paste
IloveR <- paste("I", "love", "R", sep = "-")

IloveR
#> [1] "I-love-R"

If you give paste() objects of different length, then it will apply a recycling rule. For example, if you paste a single character "X" with the sequence 1:5, and separator sep = ".", this is what you get:

# paste with objects of different lengths
paste("X", 1:5, sep = ".")
#> [1] "X.1" "X.2" "X.3" "X.4" "X.5"

To see the effect of the collapse argument, let’s compare the difference with collapsing and without it:

# paste with collapsing
paste(1:3, c("!","?","+"), sep = '', collapse = "")
#> [1] "1!2?3+"

# paste without collapsing
paste(1:3, c("!","?","+"), sep = '')
#> [1] "1!" "2?" "3+"

One of the potential problems with paste() is that it coerces missing values NA into the character "NA":

# with missing values NA
evalue <- paste("the value of 'e' is", exp(1), NA)

evalue
#> [1] "the value of 'e' is 2.71828182845905 NA"

In addition to paste(), there’s also the function paste0() which is the equivalent of

paste(..., sep = "", collapse)

# collapsing with paste0
paste0("let's", "collapse", "all", "these", "words")
#> [1] "let'scollapseallthesewords"

2.7 Getting Text into R

We’ve seen how to express character strings using single quotes '' or double quotes "". But we also need to discuss how to get text into R, that is, how to import and read files that contain character strings. So, how do we get text into R? Well, it basically depends on the type-format of the files we want to read.

We will describe two general situations. One in which the content of the file can be represented in tabular format (i.e. rows and columns). The other one when the content does not have a tabular structure. In this second case, we have characters that are in an unstructured form (i.e. just lines of strings) or at least in a non-tabular format such as html, xml, or other markup language format.

Another function is scan() which allows us to read data in several formats. Usually we use scan() to parse R scripts, but we can also use to import text (characters)

2.7.1 Reading tables

If the data we want to import is in some tabular format (i.e. cells and columns) we can use the set of functions to read tables like read.table() and its sister functions, e.g. read.csv(), read.delim(), read.fwf(). These functions read a file in table format and create a data frame from it, with rows corresponding to cases, and columns corresponding to fields in the file.

Function	Description
`read.table()`	main function to read file in table format
`read.csv()`	reads csv files separated by a comma `","`
`read.csv2()`	reads csv files separated by a semicolon `";"`
`read.delim()`	reads files separated by tabs `"\t"`
`read.delim2()`	similar to `read.delim()`
`read.fwf()`	read fixed width format files

A word of caution about the built-in functions to read data tables: by default they all convert characters into R factors. This means that if there is a column with characters, R will treat this data as qualitative variable. To turn off this behavior, we need to specify the argument stringsAsFactors = FALSE. In this way, all the characters in the imported file will be kept as characters once we read them in R.

Let’s see a simple example reading a file from the Australian radio broadcaster ABC (http://www.abc.net.au/radio/). In particular, we’ll read a csv file that contains data from ABC’s radio stations. Such file is located at:

http://www.abc.net.au/local/data/public/stations/abc-local-radio.csv

To import the file abc-local-radio.csv, we can use either read.table() or read.csv() (just choose the right parameters). Here’s the code to read the file with read.table():

# assembling url
abc <- "http://www.abc.net.au/"
radios <- "local/data/public/stations/abc-local-radio.csv"
abc_radiosl <- paste0(abc, radios)

# read data from URL
radio <- read.table(
  file = abc_radios, 
  header = TRUE, 
  sep = ',', 
  stringsAsFactors = FALSE)

In this case, the location of the file is defined in the object abc which is the first argument passed to read.table(). Then we choose other arguments such as header = TRUE, sep = ",", and stringsAsFactors = FALSE. The argument header = TRUE indicates that the first row of the file contains the names of the columns. The separator (a comma) is specifcied by sep = ",". And finally, to keep the character strings in the file as "character" in the data frame, we use stringsAsFactors = FALSE.

If everything went fine during the file reading operation, the next thing to do is to chek the size of the created data frame using dim():

# size of table in 'radio'
dim(radio)
#> [1] 53 18

Notice that the data frame radio is a table with 53 rows and 18 columns. If we examine ths structure with str() we will get information of each column. The argument vec.len = 1 indicates that we just want the first element of each variable to be displayed:

# structure of columns
str(radio, vec.len = 1)
#> 'data.frame':    53 obs. of  18 variables:
#>  $ State           : chr  "QLD" ...
#>  $ Website.URL     : chr  "http://www.abc.net.au/brisbane/" ...
#>  $ Station         : chr  "ABC Radio Brisbane" ...
#>  $ Town            : chr  " Brisbane " ...
#>  $ Latitude        : num  -27.5 ...
#>  $ Longitude       : num  153 ...
#>  $ Talkback.number : chr  "1300 222 612" ...
#>  $ Enquiries.number: chr  "07 3377 5222" ...
#>  $ Fax.number      : chr  "07 3377 5612" ...
#>  $ Sms.number      : chr  "0467 922 612" ...
#>  $ Street.number   : chr  "114 Grey Street" ...
#>  $ Street.suburb   : chr  "South Brisbane" ...
#>  $ Street.postcode : int  4101 4700 ...
#>  $ PO.box          : chr  "GPO Box 9994" ...
#>  $ PO.suburb       : chr  "Brisbane" ...
#>  $ PO.postcode     : int  4001 4700 ...
#>  $ Twitter         : chr  " abcbrisbane" ...
#>  $ Facebook        : chr  " https://www.facebook.com/abcinbrisbane" ...

As you can tell, most of the 18 variables are in "character" mode. Only $Latitude, $Longitude, $Street.postcode and $PO.postcode have a different mode.

2.7.2 Reading raw text

If what we want is to import text as is (i.e. we want to read raw text) then we need to use the function readLines(). This function is the one we should use if we don’t want R to assume that the data is in any particular form.

The way we work with readLines() is by passing it the name of a file or the name of a URL that we want to read. The output is a character vector with one element for each line of the file or url. The produced vector will contain as many elements as lines in the read file.

Let’s see how to read a text file. For this example we will use a text file from the site TEXTFILES.COM (by Jason Scott) http://www.textfiles.com/music/ . This site contains a section of music related text files. For demonstration purposes let’s consider the “Top 105.3 songs of 1991” according to the “Modern Rock” radio station KITS San Francisco. The corresponding txt file is located at:

http://www.textfiles.com/music/ktop100.txt.

# read 'ktop100.txt' file
top105 <- readLines("http://www.textfiles.com/music/ktop100.txt")

readLines() creates a character vector in which each element represents the lines of the URL we are trying to read. To know how many elements (i.e how many lines) are in top105 we can use the function length(). To inspect the first elements (i.e. first lines of the text file) use head()

# how many lines
length(top105)
#> [1] 123

# inspecting first elements
head(top105)
#> [1] "From: ed@wente.llnl.gov (Ed Suranyi)"
#> [2] "Date: 12 Jan 92 21:23:55 GMT"        
#> [3] "Newsgroups: rec.music.misc"          
#> [4] "Subject: KITS' year end countdown"   
#> [5] ""                                    
#> [6] ""

Looking at the output provided by head() the first four lines contain some information about the subject of the email (KITS’ year end countdown). The fifth and sixth lines are empty lines. If we inspect the next few lines, we’ll see that the list of songs in the top100 starts at line number 11.

# top 5 songs
top105[11:15]
#> [1] "1. NIRVANA                      SMELLS LIKE TEEN SPIRIT"
#> [2] "2. EMF                          UNBELIEVABLE"           
#> [3] "3. R.E.M.                       LOSING MY RELIGION"     
#> [4] "4. SIOUXSIE & THE BANSHEES      KISS THEM FOR ME"       
#> [5] "5. B.A.D. II                    RUSH"

Each line has the ranking number, followed by a dot, followed by a blank space, then the name of the artist/group, followed by a bunch of white spaces, and then the title of the song. As you can see, the number one hit of 1991 was “Smells like teen spirit” by Nirvana.

What about the last songs in KITS’ ranking? In order to get the answer we can use the tail() function to inspect the last n = 10 elements of the file:

# inspecting last 10 elements
tail(top105, n = 10)
#>  [1] "101. SMASHING PUMPKINS          SIVA"                       
#>  [2] "102. ELVIS COSTELLO             OTHER SIDE OF ..."          
#>  [3] "103. SEERS                      PSYCHE OUT"                 
#>  [4] "104. THRILL KILL CULT           SEX ON WHEELZ"              
#>  [5] "105. MATTHEW SWEET              I'VE BEEN WAITING"          
#>  [6] "105.3  LATOUR                   PEOPLE ARE STILL HAVING SEX"
#>  [7] ""                                                           
#>  [8] "Ed"                                                         
#>  [9] "ed@wente.llnl.gov"                                          
#> [10] ""

Note that the last four lines don’t contain information about the songs. Moreover, the number of songs does not stop at 105. In fact the ranking goes till 106 songs (last number being 105.3)

We’ll stop here the discussion of this chapter. However, it is importat to keep in mind that text files come in a great variety of forms, shapes, sizes, and flavors. For more information on how to import files in R, the authoritative document is the guide on R Data Import/Export (by the R Core Team) available at:

http://cran.r-project.org/doc/manuals/r-release/R-data.html