As we saw in the previous chapters, R provides a useful range of functions for basic string processing and manipulations of "character" data. Most of the times these functions are enough and they will allow us to get our job done. Sometimes, however, they have an awkward behavior.
As an example, consider the function paste(). The default separator is a blank space, which more often than not is what we want to use. But that’s secondary. The really annoying thing is when we want to paste things that include zero length arguments. How does paste() behave in those cases? See below:
# this works finepaste("University", "of", "California", "Berkeley")
[1] "University of California Berkeley"
# this works fine toopaste("University", "of", "California", "Berkeley")
[1] "University of California Berkeley"
# this is weirdpaste("University", "of", "California", "Berkeley", NULL)
[1] "University of California Berkeley "
# this is uglypaste("University", "of", "California", "Berkeley", NULL, character(0), "Go Bears!")
[1] "University of California Berkeley Go Bears!"
Notice the output from the last example (the ugly one). The objects NULL and character(0) have zero length, yet when included inside paste() they are treated as an empty string "". Wouldn’t be good if paste() removed zero length arguments? Sadly, there’s nothing we can do to change nchar() and paste(). But fear not. There is a very nice package that solves these problems and provides several functions for carrying out consistent string processing.
4.1 Package "stringr"
Thanks to Hadley Wickham and company, we have the package "stringr" that adds more functionality to the base functions for handling strings in R. According to the description of the package
"stringr" is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.”
To install "stringr" use the function install.packages(). Once installed, load it to your current session with library():
"stringr" provides functions for both 1) basic manipulations and 2) for regular expression operations. In this chapter we cover those functions that have to do with basic manipulations.
The following table contains the "stringr" functions for basic string operations:
Function
Description
Similar to
str_c()
string concatenation
paste()
str_length()
number of characters
nchar()
str_sub()
extracts substrings
substring()
str_dup()
duplicates characters
none
str_trim()
removes leading and trailing whitespace
none
str_pad()
pads a string
none
str_wrap()
wraps a string paragraph
strwrap()
str_trim()
trims a string
none
Notice that all functions start with "str_" followed by a term associated to the task they perform. For example, str_length() gives you the number (i.e. length) of characters in a string. In addition, some functions are designed to provide a better alternative to already existing functions. This is the case of str_length() which is intended to be a substitute of nchar(). Other functions, however, don’t have a corresponding alternative such as str_dup() which allows you to duplicate characters.
4.2.1 Concatenating with str_c()
Let’s begin with str_c(). This function is equivalent to paste() but instead of using the white space as the default separator, str_c() uses the empty string "" which is a more common separator when pasting strings:
As you can see from the previous examples, an alternative for str _() is str_glue() with the argument .sep.
4.2.2 Number of characters with str_length()
As we’ve mentioned before, the function str_length() is equivalent to nchar(). Both functions return the number of characters in a string, that is, the length of a string (do not confuse it with the length() of a vector). Compared to nchar(), str_length() has a more consistent behavior when dealing with NA values. Instead of giving NA a length of 2, str_length() preserves missing values just as NAs.
# some text (NA included)some_text <-c("one", "two", "three", NA, "five")# compare 'str_length' with 'nchar'nchar(some_text)
[1] 3 3 5 NA 4
str_length(some_text)
[1] 3 3 5 NA 4
In addition, str_length() has the nice feature that it converts factors to characters, something that nchar() is not able to handle:
Error in nchar(some_factor): 'nchar()' requires a character vector
# now compare it with 'str_length'str_length(some_factor)
[1] 4 4 4 3 3 3
4.2.3 Substring with str_sub()
To extract substrings from a character vector stringr provides str_sub() which is equivalent to substring(). The function str_sub() has the following usage form:
str_sub(string, start = 1L, end = -1L)
The three arguments in the function are: a string vector, a start value indicating the position of the first character in substring, and an end value indicating the position of the last character. Here’s a simple example with a single string in which characters from 1 to 5 are extracted:
lorem <-"Lorem Ipsum"# apply 'str_sub'str_sub(lorem, start =1, end =5)
[1] "Lorem"
# equivalent to 'substring'substring(lorem, first =1, last =5)
[1] "Lorem"
# another examplestr_sub("adios", 1:3)
[1] "adios" "dios" "ios"
An interesting feature of str_sub() is its ability to work with negative indices in the start and end positions. When we use a negative position, str_sub() counts backwards from last character:
resto =c("brasserie", "bistrot", "creperie", "bouchon")# 'str_sub' with negative positionsstr_sub(resto, start =-4, end =-1)
[1] "erie" "trot" "erie" "chon"
# compared to substring (useless)substring(resto, first =-4, last =-1)
[1] "" "" "" ""
Similar to substring(), we can also give str_sub() a set of positions which will be recycled over the string. But even better, we can give str_sub() a negative sequence, something that substring() ignores:
A common operation when handling characters is duplication. The problem is that R doesn’t have a specific function for that purpose. But stringr does: str_dup() duplicates and concatenates strings within a character vector. Its usage requires two arguments:
str_dup(string, times)
The first input is the string that you want to dplicate. The second input, times, is the number of times to duplicate each string:
# default usagestr_dup("hola", 3)
[1] "holaholahola"
# use with differetn 'times'str_dup("adios", 1:3)
[1] "adios" "adiosadios" "adiosadiosadios"
# use with a string vectorwords <-c("lorem", "ipsum", "dolor", "sit", "amet")str_dup(words, 2)
Another handy function that we can find in stringr is str_pad() for padding a string. Its default usage has the following form:
str_pad(string, width, side = "left", pad = " ")
The idea of str_pad() is to take a string and pad it with leading or trailing characters to a specified total width. The default padding character is a space (pad = " "), and consequently the returned string will appear to be either left-aligned (side = "left"), right-aligned (side = "right"), or both (side = "both").
Let’s see some examples:
# default usagestr_pad("hola", width =7)
[1] " hola"
# pad both sidesstr_pad("adios", width =7, side ="both")
[1] " adios "
# left padding with '#'str_pad("hashtag", width =8, pad ="#")
[1] "#hashtag"
# pad both sides with '-'str_pad("hashtag", width =9, side ="both", pad ="-")
[1] "-hashtag-"
4.2.6 Wrapping with str_wrap()
The function str_wrap() is equivalent to strwrap() which can be used to wrap a string to format paragraphs. The idea of wrapping a (long) string is to first split it into paragraphs according to the given width, and then add the specified indentation in each line (first line with indent, following lines with exdent). Its default usage has the following form:
For instance, consider the following quote (from Douglas Adams) converted into a paragraph:
# quote (by Douglas Adams)some_quote <-c("I may not have gone","where I intended to go,", "but I think I have ended up","where I needed to be")# some_quote in a single paragraphsome_quote <-paste(some_quote, collapse =" ")
Now, say you want to display the text of some_quote within some pre-specified column width (e.g. width of 30). You can achieve this by applying str_wrap() and setting the argument width = 30
# display paragraph with width=30cat(str_wrap(some_quote, width =30))
I may not have gone where I
intended to go, but I think I
have ended up where I needed
to be
Besides displaying a (long) paragraph into several lines, you may also wish to add some indentation. Here’s how you can indent the first line, as well as the following lines:
# display paragraph with first line indentation of 2cat(str_wrap(some_quote, width =30, indent =2), "\n")
I may not have gone where I
intended to go, but I think I
have ended up where I needed
to be
# display paragraph with following lines indentation of 3cat(str_wrap(some_quote, width =30, exdent =3), "\n")
I may not have gone where I
intended to go, but I think
I have ended up where I
needed to be
4.2.7 Trimming with str_trim()
One of the typical tasks of string processing is that of parsing a text into individual words. Usually, you end up with words that have blank spaces, called whitespaces, on either end of the word. In this situation, you can use the str_trim() function to remove any number of whitespaces at the ends of a string. Its usage requires only two arguments:
str_trim(string, side = "both")
The first input is the string to be strimmed, and the second input indicates the side on which the whitespace will be removed.
Consider the following vector of strings, some of which have whitespaces either on the left, on the right, or on both sides. Here’s what str_trim() would do to them under different settings of side
# text with whitespacesbad_text <-c("This", " example ", "has several ", " whitespaces ")# remove whitespaces on the left sidestr_trim(bad_text, side ="left")
[1] "This" "example " "has several " "whitespaces "
# remove whitespaces on the right sidestr_trim(bad_text, side ="right")
The way in which you use word() is by passing it a string, together with a start position of the first word to extract, and an end position of the last word to extract. By default, the separator sep used between words is a single space.
Let’s see some examples:
# some sentencechange <-c("Be the change", "you want to be")# extract first wordword(change, 1)
[1] "Be" "you"
# extract second wordword(change, 2)
[1] "the" "want"
# extract last wordword(change, -1)
[1] "change" "be"
# extract all but the first wordsword(change, 2, -1)
[1] "the change" "want to be"
"stringr" has more functions but we’ll discuss them in the chapters about regular expressions.