4 Basic Manipulations with "stringr" Functions

4.1 Introduction

As we saw in the previous chapters, R provides a useful range of functions for basic string processing and manipulations of "character" data. Most of the times these functions are enough and they will allow us to get our job done. Sometimes, however, they have an awkward behavior.

As an example, consider the function paste(). The default separator is a blank space, which more often than not is what we want to use. But that’s secondary. The really annoying thing is when we want to paste things that include zero length arguments. How does paste() behave in those cases? See below:

# this works fine
paste("University", "of", "California", "Berkeley")
#> [1] "University of California Berkeley"

# this works fine too
paste("University", "of", "California", "Berkeley")
#> [1] "University of California Berkeley"

# this is weird
paste("University", "of", "California", "Berkeley", NULL)
#> [1] "University of California Berkeley "

# this is ugly
paste("University", "of", "California", "Berkeley", NULL, character(0), 
      "Go Bears!")
#> [1] "University of California Berkeley   Go Bears!"

Notice the output from the last example (the ugly one). The objects NULL and character(0) have zero length, yet when included inside paste() they are treated as an empty string "". Wouldn’t be good if paste() removed zero length arguments? Sadly, there’s nothing we can do to change nchar() and paste(). But fear not. There is a very nice package that solves these problems and provides several functions for carrying out consistent string processing.

4.2 Package "stringr"

Thanks to Hadley Wickham and company, we have the package "stringr" that adds more functionality to the base functions for handling strings in R. According to the description of the package

http://cran.r-project.org/web/packages/stringr/index.html

"stringr" is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions."

To install "stringr" use the function install.packages(). Once installed, load it to your current session with library():

# installing 'stringr'
install.packages("stringr")

# load 'stringr'
library(stringr)

4.3 Basic String Operations

"stringr" provides functions for both 1) basic manipulations and 2) for regular expression operations. In this chapter we cover those functions that have to do with basic manipulations.

The following table contains the "stringr" functions for basic string operations:

Function Description Similar to
str_c() string concatenation paste()
str_length() number of characters nchar()
str_sub() extracts substrings substring()
str_dup() duplicates characters none
str_trim() removes leading and trailing whitespace none
str_pad() pads a string none
str_wrap() wraps a string paragraph strwrap()
str_trim() trims a string none

Notice that all functions start with "str_" followed by a term associated to the task they perform. For example, str_length() gives you the number (i.e. length) of characters in a string. In addition, some functions are designed to provide a better alternative to already existing functions. This is the case of str_length() which is intended to be a substitute of nchar(). Other functions, however, don’t have a corresponding alternative such as str_dup() which allows you to duplicate characters.

4.3.1 Concatenating with str_c()

Let’s begin with str_c(). This function is equivalent to paste() but instead of using the white space as the default separator, str_c() uses the empty string "" which is a more common separator when pasting strings:

# default usage
str_c("May", "The", "Force", "Be", "With", "You")
#> [1] "MayTheForceBeWithYou"

# removing zero length objects
str_c("May", "The", "Force", NULL, "Be", "With", "You", character(0))
#> [1] "MayTheForceBeWithYou"

Observe another major difference between str_c() and paste(): zero length arguments like NULL and character(0) are silently removed by str_c().

If you want to change the default separator, you can do that as usual by specifying the argument sep:

# changing separator
str_c("May", "The", "Force", "Be", "With", "You", .sep = "_")
#> [1] "MayTheForceBeWithYou_"

# synonym function 'str_glue'
str_glue("May", "The", "Force", "Be", "With", "You", .sep = "_")
#> May_The_Force_Be_With_You

As you can see from the previous examples, an alternative for str _() is str_glue() with the argument .sep.

4.3.2 Number of characters with str_length()

As we’ve mentioned before, the function str_length() is equivalent to nchar(). Both functions return the number of characters in a string, that is, the length of a string (do not confuse it with the length() of a vector). Compared to nchar(), str_length() has a more consistent behavior when dealing with NA values. Instead of giving NA a length of 2, str_length() preserves missing values just as NAs.

# some text (NA included)
some_text <- c("one", "two", "three", NA, "five")

# compare 'str_length' with 'nchar'
nchar(some_text)
#> [1]  3  3  5 NA  4
str_length(some_text)
#> [1]  3  3  5 NA  4

In addition, str_length() has the nice feature that it converts factors to characters, something that nchar() is not able to handle:

some_factor <- factor(c(1,1,1,2,2,2), labels = c("good", "bad"))
some_factor
#> [1] good good good bad  bad  bad 
#> Levels: good bad

# try 'nchar' on a factor
nchar(some_factor)
#> Error in nchar(some_factor): 'nchar()' requires a character vector

# now compare it with 'str_length'
str_length(some_factor)
#> [1] 4 4 4 3 3 3

4.3.3 Substring with str_sub()

To extract substrings from a character vector stringr provides str_sub() which is equivalent to substring(). The function str_sub() has the following usage form:

str_sub(string, start = 1L, end = -1L)

The three arguments in the function are: a string vector, a start value indicating the position of the first character in substring, and an end value indicating the position of the last character. Here’s a simple example with a single string in which characters from 1 to 5 are extracted:

lorem <- "Lorem Ipsum"

# apply 'str_sub'
str_sub(lorem, start = 1, end = 5)
#> [1] "Lorem"

# equivalent to 'substring'
substring(lorem, first = 1, last = 5)
#> [1] "Lorem"

# another example
str_sub("adios", 1:3)
#> [1] "adios" "dios"  "ios"

An interesting feature of str_sub() is its ability to work with negative indices in the start and end positions. When we use a negative position, str_sub() counts backwards from last character:

resto = c("brasserie", "bistrot", "creperie", "bouchon")

# 'str_sub' with negative positions
str_sub(resto, start = -4, end = -1)
#> [1] "erie" "trot" "erie" "chon"

# compared to substring (useless)
substring(resto, first = -4, last = -1)
#> [1] "" "" "" ""

Similar to substring(), we can also give str_sub() a set of positions which will be recycled over the string. But even better, we can give str_sub() a negative sequence, something that substring() ignores:

# extracting sequentially
str_sub(lorem, seq_len(nchar(lorem)))
#>  [1] "Lorem Ipsum" "orem Ipsum"  "rem Ipsum"   "em Ipsum"    "m Ipsum"    
#>  [6] " Ipsum"      "Ipsum"       "psum"        "sum"         "um"         
#> [11] "m"
substring(lorem, seq_len(nchar(lorem)))
#>  [1] "Lorem Ipsum" "orem Ipsum"  "rem Ipsum"   "em Ipsum"    "m Ipsum"    
#>  [6] " Ipsum"      "Ipsum"       "psum"        "sum"         "um"         
#> [11] "m"

# reverse substrings with negative positions
str_sub(lorem, -seq_len(nchar(lorem)))
#>  [1] "m"           "um"          "sum"         "psum"        "Ipsum"      
#>  [6] " Ipsum"      "m Ipsum"     "em Ipsum"    "rem Ipsum"   "orem Ipsum" 
#> [11] "Lorem Ipsum"
substring(lorem, -seq_len(nchar(lorem)))
#>  [1] "Lorem Ipsum" "Lorem Ipsum" "Lorem Ipsum" "Lorem Ipsum" "Lorem Ipsum"
#>  [6] "Lorem Ipsum" "Lorem Ipsum" "Lorem Ipsum" "Lorem Ipsum" "Lorem Ipsum"
#> [11] "Lorem Ipsum"

We can use str_sub() not only for extracting subtrings but also for replacing substrings:

# replacing 'Lorem' with 'Nullam'
lorem <- "Lorem Ipsum"
str_sub(lorem, 1, 5) <- "Nullam"
lorem
#> [1] "Nullam Ipsum"

# replacing with negative positions
lorem <- "Lorem Ipsum"
str_sub(lorem, -1) <- "Nullam"
lorem
#> [1] "Lorem IpsuNullam"

# multiple replacements 
lorem <- "Lorem Ipsum"
str_sub(lorem, c(1,7), c(5,8)) <- c("Nullam", "Enim")
lorem
#> [1] "Nullam Ipsum"  "Lorem Enimsum"

4.3.4 Duplication with str_dup()

A common operation when handling characters is duplication. The problem is that R doesn’t have a specific function for that purpose. But stringr does: str_dup() duplicates and concatenates strings within a character vector. Its usage requires two arguments:

str_dup(string, times)

The first input is the string that you want to dplicate. The second input, times, is the number of times to duplicate each string:

# default usage
str_dup("hola", 3)
#> [1] "holaholahola"

# use with differetn 'times'
str_dup("adios", 1:3)
#> [1] "adios"           "adiosadios"      "adiosadiosadios"

# use with a string vector
words <- c("lorem", "ipsum", "dolor", "sit", "amet")
str_dup(words, 2)
#> [1] "loremlorem" "ipsumipsum" "dolordolor" "sitsit"     "ametamet"

str_dup(words, 1:5)
#> [1] "lorem"                "ipsumipsum"           "dolordolordolor"     
#> [4] "sitsitsitsit"         "ametametametametamet"

4.3.5 Padding with str_pad()

Another handy function that we can find in stringr is str_pad() for padding a string. Its default usage has the following form:

str_pad(string, width, side = "left", pad = " ")

The idea of str_pad() is to take a string and pad it with leading or trailing characters to a specified total width. The default padding character is a space (pad = " "), and consequently the returned string will appear to be either left-aligned (side = "left"), right-aligned (side = "right"), or both (side = "both").

Let’s see some examples:

# default usage
str_pad("hola", width = 7)
#> [1] "   hola"

# pad both sides
str_pad("adios", width = 7, side = "both")
#> [1] " adios "

# left padding with '#'
str_pad("hashtag", width = 8, pad = "#")
#> [1] "#hashtag"

# pad both sides with '-'
str_pad("hashtag", width = 9, side = "both", pad = "-")
#> [1] "-hashtag-"

4.3.6 Wrapping with str_wrap()

The function str_wrap() is equivalent to strwrap() which can be used to wrap a string to format paragraphs. The idea of wrapping a (long) string is to first split it into paragraphs according to the given width, and then add the specified indentation in each line (first line with indent, following lines with exdent). Its default usage has the following form:

str_wrap(string, width = 80, indent = 0, exdent = 0)

For instance, consider the following quote (from Douglas Adams) converted into a paragraph:

# quote (by Douglas Adams)
some_quote <- c(
  "I may not have gone",
  "where I intended to go,", 
  "but I think I have ended up",
  "where I needed to be")

# some_quote in a single paragraph
some_quote <- paste(some_quote, collapse = " ")

Now, say you want to display the text of some_quote within some pre-specified column width (e.g. width of 30). You can achieve this by applying str_wrap() and setting the argument width = 30

# display paragraph with width=30
cat(str_wrap(some_quote, width = 30))
#> I may not have gone where I
#> intended to go, but I think I
#> have ended up where I needed
#> to be

Besides displaying a (long) paragraph into several lines, you may also wish to add some indentation. Here’s how you can indent the first line, as well as the following lines:

# display paragraph with first line indentation of 2
cat(str_wrap(some_quote, width = 30, indent = 2), "\n")
#>   I may not have gone where I
#> intended to go, but I think I
#> have ended up where I needed
#> to be

# display paragraph with following lines indentation of 3
cat(str_wrap(some_quote, width = 30, exdent = 3), "\n")
#> I may not have gone where I
#>    intended to go, but I think I
#>    have ended up where I needed
#>    to be

4.3.7 Trimming with str_trim()

One of the typical tasks of string processing is that of parsing a text into individual words. Usually, you end up with words that have blank spaces, called whitespaces, on either end of the word. In this situation, you can use the str_trim() function to remove any number of whitespaces at the ends of a string. Its usage requires only two arguments:

str_trim(string, side = "both")

The first input is the string to be strimmed, and the second input indicates the side on which the whitespace will be removed.

Consider the following vector of strings, some of which have whitespaces either on the left, on the right, or on both sides. Here’s what str_trim() would do to them under different settings of side

# text with whitespaces
bad_text <- c("This", " example ", "has several   ", "   whitespaces ")

# remove whitespaces on the left side
str_trim(bad_text, side = "left")
#> [1] "This"           "example "       "has several   " "whitespaces "

# remove whitespaces on the right side
str_trim(bad_text, side = "right")
#> [1] "This"           " example"       "has several"    "   whitespaces"

# remove whitespaces on both sides
str_trim(bad_text, side = "both")
#> [1] "This"        "example"     "has several" "whitespaces"

4.3.8 Word extraction with word()

We end this chapter describing the word() function that is designed to extract words from a sentence:

word(string, start = 1L, end = start, sep = fixed(" "))

The way in which you use word() is by passing it a string, together with a start position of the first word to extract, and an end position of the last word to extract. By default, the separator sep used between words is a single space.

Let’s see some examples:

# some sentence
change <- c("Be the change", "you want to be")

# extract first word
word(change, 1)
#> [1] "Be"  "you"

# extract second word
word(change, 2)
#> [1] "the"  "want"

# extract last word
word(change, -1)
#> [1] "change" "be"

# extract all but the first words
word(change, 2, -1)
#> [1] "the change" "want to be"

"stringr" has more functions but we’ll discuss them in the chapters about regular expressions.