7 Stringr Basics

7.1 Introduction

So far we’ve seen the various functions R provides to perform basic string processing and manipulations of "character" data. Most of the times these functions are enough and they will allow you to get your job done. However, they have some drawbacks. For instance, consider the following example:

As you can tell, nchar() gives NA a value of 2, as if it were a string formed by two characters. Perhaps this may be acceptable in some cases, but taking into account all the operations in R, it would be better to leave NA as a missing value, instead of treating it as a string of two characters.

Another awkward example can be found with paste(). The default separator is a blank space, which more often than not is what you want to use. But that’s secondary. The really annoying thing is when you want to paste things that include zero length arguments (e.g. NULL, character(0)). How does paste() behave in those cases? See below:

Notice the output from the last example (the ugly one). The objects NULL and character(0) have zero length, yet when included inside paste() they are treated as an empty string "". Wouldn’t be good if paste() removed zero length arguments? Sadly, there’s nothing we can do to change nchar() and paste(). But fear not. There is a very nice package that solves these problems and provides several functions for carrying out consistent string processing.

7.2 Package stringr

Thanks to Hadley Wickham, we have the package stringr that adds more functionality to the base functions for handling strings in R.

http://cran.r-project.org/web/packages/stringr/index.html

According to the description of the package:

“is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.”

To install "stringr" use the function install.packages(). Once installed, load it to your current session with library():

7.3 Basic String Operations

"stringr" provides functions for both 1) basic manipulations and 2) for regular expression operations. In this chapter we cover those functions that have to do with basic manipulations.

The following table contains the stringr functions for basic string operations:

Function Description Similar to
str_c() string concatenation paste()
str_length() number of characters nchar()
str_sub() extracts substrings substring()
str_dup() duplicates characters none
str_trim() removes leading and trailing whitespace none
str_pad() pads a string none
str_wrap() wraps a string paragraph strwrap()
str_trim() trims a string none

Notice that all functions in stringr start with "str_" followed by a term associated to the task they perform. For example, str_length() gives you the number (i.e. length) of characters in a string. In addition, some functions are designed to provide a better alternative to already existing functions. This is the case of str_length() which is intended to be a substitute of nchar(). Other functions, however, don’t have a corresponding alternative such as str_dup() which allows you to duplicate characters.

7.3.1 Concatenating with str_c()

Let’s begin with str_c(). This function is equivalent to paste() but instead of using the white space as the default separator, str_c() uses the empty string "" which is a more common separator when pasting strings:

Observe another major difference between str_c() and paste(): zero length arguments like NULL and character(0) are silently removed by str_c().

If you want to change the default separator, you can do that as usual by specifying the argument sep:

7.3.2 Number of characters with str_length()

As we’ve mentioned before, the function str_length() is equivalent to nchar(). Both functions return the number of characters in a string, that is, the length of a string (do not confuse it with the length() of a vector). Compared to nchar(), str_length() has a more consistent behavior when dealing with NA values. Instead of giving NA a length of 2, str_length() preserves missing values just as NAs.

In addition, str_length() has the nice feature that it converts factors to characters, something that nchar() is not able to handle:

7.3.3 Substring with str_sub()

To extract substrings from a character vector stringr provides str_sub() which is equivalent to substring(). The function str_sub() has the following usage form:

str_sub(string, start = 1L, end = -1L)

The three arguments in the function are: a string vector, a start value indicating the position of the first character in substring, and an end value indicating the position of the last character. Here’s a simple example with a single string in which characters from 1 to 5 are extracted:

An interesting feature of str_sub() is its ability to work with negative indices in the start and end positions. When we use a negative position, str_sub() counts backwards from last character:

Similar to substring(), we can also give str_sub() a set of positions which will be recycled over the string. But even better, we can give str_sub() a negative sequence, something that substring() ignores:

We can use str_sub() not only for extracting subtrings but also for replacing substrings:

7.3.4 Duplication with str_dup()

A common operation when handling characters is duplication. The problem is that R doesn’t have a specific function for that purpose. But stringr does: str_dup() duplicates and concatenates strings within a character vector. Its usage requires two arguments:

str_dup(string, times)

The first input is the string that you want to dplicate. The second input, times, is the number of times to duplicate each string:

7.3.5 Padding with str_pad()

Another handy function that we can find in stringr is str_pad() for padding a string. Its default usage has the following form:

str_pad(string, width, side = "left", pad = " ")

The idea of str_pad() is to take a string and pad it with leading or trailing characters to a specified total width. The default padding character is a space (pad = " "), and consequently the returned string will appear to be either left-aligned (side = "left"), right-aligned (side = "right"), or both (side = "both").

Let’s see some examples:

7.3.6 Wrapping with str_wrap()

The function str_wrap() is equivalent to strwrap() which can be used to wrap a string to format paragraphs. The idea of wrapping a (long) string is to first split it into paragraphs according to the given width, and then add the specified indentation in each line (first line with indent, following lines with exdent). Its default usage has the following form:

str_wrap(string, width = 80, indent = 0, exdent = 0)

For instance, consider the following quote (from Douglas Adams) converted into a paragraph:

Now, say you want to display the text of some_quote within some pre-specified column width (e.g. width of 30). You can achieve this by applying str_wrap() and setting the argument width = 30

Besides displaying a (long) paragraph into several lines, you may also wish to add some indentation. Here’s how you can indent the first line, as well as the following lines:

7.3.7 Trimming with str_trim()

One of the typical tasks of string processing is that of parsing a text into individual words. Usually, you end up with words that have blank spaces, called whitespaces, on either end of the word. In this situation, you can use the str_trim() function to remove any number of whitespaces at the ends of a string. Its usage requires only two arguments:

str_trim(string, side = "both")

The first input is the string to be strimmed, and the second input indicates the side on which the whitespace will be removed.

Consider the following vector of strings, some of which have whitespaces either on the left, on the right, or on both sides. Here’s what str_trim() would do to them under different settings of side

7.3.8 Word extraction with word()

We end this chapter describing the word() function that is designed to extract words from a sentence:

word(string, start = 1L, end = start, sep = fixed(" "))

The way in which you use word() is by passing it a string, together with a start position of the first word to extract, and an end position of the last word to extract. By default, the separator sep used between words is a single space.

Let’s see some examples:

stringr has more functions but we’ll discuss them in the chapters about regular expressions.