1 Introductory Appetizer

To give you an idea of some of the things we can do in R with string processing, let’s play a bit with a simple example.

1.1 A Toy Example

For this crash informal introduction, we’ll use the data frame USArrests that already comes with R. Use the function head() to get a peek of the data:

# take a peek of USArrests
head(USArrests)
#>            Murder Assault UrbanPop Rape
#> Alabama      13.2     236       58 21.2
#> Alaska       10.0     263       48 44.5
#> Arizona       8.1     294       80 31.0
#> Arkansas      8.8     190       50 19.5
#> California    9.0     276       91 40.6
#> Colorado      7.9     204       78 38.7

The labels on the rows such as Alabama or Alaska are displayed strings. Likewise, the labels of the columns—Murder, Assault, UrbanPop and Rape—are also strings.

1.1.1 Abbreviating strings

Suppose we want to abbreviate the names of the States. Furthermore, suppose we want to abbreviate the names using the first four characters of each name. One way to do that is by using the function substr() which substrings a character vector. We just need to indicate the start=1 and stop=4 positions:

# names of states
states <- rownames(USArrests)

# substr
substr(x = states, start = 1, stop = 4)
#>  [1] "Alab" "Alas" "Ariz" "Arka" "Cali" "Colo" "Conn" "Dela" "Flor" "Geor"
#> [11] "Hawa" "Idah" "Illi" "Indi" "Iowa" "Kans" "Kent" "Loui" "Main" "Mary"
#> [21] "Mass" "Mich" "Minn" "Miss" "Miss" "Mont" "Nebr" "Neva" "New " "New "
#> [31] "New " "New " "Nort" "Nort" "Ohio" "Okla" "Oreg" "Penn" "Rhod" "Sout"
#> [41] "Sout" "Tenn" "Texa" "Utah" "Verm" "Virg" "Wash" "West" "Wisc" "Wyom"

This may not be the best solution. Note that there are four states with the same abbreviation "New " (New Hampshire, New Jersey, New Mexico, New York). Likewise, North Carolina and North Dakota share the same name "Nort". In turn, South Carolina and South Dakota got the same abbreviation "Sout".

A better way to abbreviate the names of the states can be performed by using the function abbreviate() like so:

# abbreviate state names
states2 <- abbreviate(states)

# remove vector names (for convenience)
names(states2) <- NULL
states2
#>  [1] "Albm" "Alsk" "Arzn" "Arkn" "Clfr" "Clrd" "Cnnc" "Dlwr" "Flrd" "Gerg"
#> [11] "Hawa" "Idah" "Illn" "Indn" "Iowa" "Knss" "Kntc" "Losn" "Main" "Mryl"
#> [21] "Mssc" "Mchg" "Mnns" "Msss" "Mssr" "Mntn" "Nbrs" "Nevd" "NwHm" "NwJr"
#> [31] "NwMx" "NwYr" "NrtC" "NrtD" "Ohio" "Oklh" "Orgn" "Pnns" "RhdI" "SthC"
#> [41] "SthD" "Tnns" "Texs" "Utah" "Vrmn" "Vrgn" "Wshn" "WstV" "Wscn" "Wymn"

If we decide to try an abbreviation with five letters we just simply change the argument minlength = 5

# abbreviate state names with 5 letters
abbreviate(states, minlength = 5)
#>        Alabama         Alaska        Arizona       Arkansas     California 
#>        "Alabm"        "Alask"        "Arizn"        "Arkns"        "Clfrn" 
#>       Colorado    Connecticut       Delaware        Florida        Georgia 
#>        "Colrd"        "Cnnct"        "Delwr"        "Flord"        "Georg" 
#>         Hawaii          Idaho       Illinois        Indiana           Iowa 
#>        "Hawai"        "Idaho"        "Illns"        "Indin"         "Iowa" 
#>         Kansas       Kentucky      Louisiana          Maine       Maryland 
#>        "Kanss"        "Kntck"        "Lousn"        "Maine"        "Mryln" 
#>  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
#>        "Mssch"        "Mchgn"        "Mnnst"        "Mssss"        "Missr" 
#>        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
#>        "Montn"        "Nbrsk"        "Nevad"        "NwHmp"        "NwJrs" 
#>     New Mexico       New York North Carolina   North Dakota           Ohio 
#>        "NwMxc"        "NwYrk"        "NrthC"        "NrthD"         "Ohio" 
#>       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
#>        "Oklhm"        "Oregn"        "Pnnsy"        "RhdIs"        "SthCr" 
#>   South Dakota      Tennessee          Texas           Utah        Vermont 
#>        "SthDk"        "Tnnss"        "Texas"         "Utah"        "Vrmnt" 
#>       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
#>        "Virgn"        "Wshng"        "WstVr"        "Wscns"        "Wymng"

1.1.2 Getting the longest name

Now let’s imagine that we need to find the longest name. This implies that we need to count the number of letters in each name. The function nchar() comes handy for that purpose. Here’s how we could do it:

# size (in characters) of each name
state_chars = nchar(states)
state_chars
#>  [1]  7  6  7  8 10  8 11  8  7  7  6  5  8  7  4  6  8  9  5  8 13  8  9 11  8
#> [26]  7  8  6 13 10 10  8 14 12  4  8  6 12 12 14 12  9  5  4  7  8 10 13  9  7

# longest name
states[which(state_chars == max(state_chars))]
#> [1] "North Carolina" "South Carolina"

1.1.3 Selecting States

To make things more interesting, let’s assume that we wish to select those states containing the letter "k". How can we do that? Very simple, we just need to use the function grep() for working with regular expressions. Simply indicate the pattern = "k" as follows:

# get states names with 'k'
grep(pattern = "k", x = states, value = TRUE)
#> [1] "Alaska"       "Arkansas"     "Kentucky"     "Nebraska"     "New York"    
#> [6] "North Dakota" "Oklahoma"     "South Dakota"

Instead of grabbing those names containing "k", say we wish to select those states containing the letter "w". Again, this can be done with grep():

# get states names with 'w'
grep(pattern = "w", x = states, value = TRUE)
#> [1] "Delaware"      "Hawaii"        "Iowa"          "New Hampshire"
#> [5] "New Jersey"    "New Mexico"    "New York"

Notice that we only selected those states with lowercase "w". But what about those states with uppercase "W"? There are several options to find a solution for this question. One option is to specify the searched pattern as a character class "[wW]":

# get states names with 'w' or 'W'
grep(pattern = "[wW]", x = states, value = TRUE)
#>  [1] "Delaware"      "Hawaii"        "Iowa"          "New Hampshire"
#>  [5] "New Jersey"    "New Mexico"    "New York"      "Washington"   
#>  [9] "West Virginia" "Wisconsin"     "Wyoming"

Another solution is to first convert the state names to lower case, and then look for the character "w", like so:

# get states names with 'w'
grep(pattern = "w", x = tolower(states), value = TRUE)
#>  [1] "delaware"      "hawaii"        "iowa"          "new hampshire"
#>  [5] "new jersey"    "new mexico"    "new york"      "washington"   
#>  [9] "west virginia" "wisconsin"     "wyoming"

Alternatively, instead of converting the state names to lower case we could do the opposite (convert to upper case), and then look for the character "W", like so:

# get states names with 'W'
grep(pattern = "W", x = toupper(states), value = TRUE)
#>  [1] "DELAWARE"      "HAWAII"        "IOWA"          "NEW HAMPSHIRE"
#>  [5] "NEW JERSEY"    "NEW MEXICO"    "NEW YORK"      "WASHINGTON"   
#>  [9] "WEST VIRGINIA" "WISCONSIN"     "WYOMING"

A third solution involves specifying the argument ignore.case=TRUE inside grep():

# get states names with 'w'
grep(pattern = "w", x = states, value = TRUE, ignore.case = TRUE)
#>  [1] "Delaware"      "Hawaii"        "Iowa"          "New Hampshire"
#>  [5] "New Jersey"    "New Mexico"    "New York"      "Washington"   
#>  [9] "West Virginia" "Wisconsin"     "Wyoming"

1.1.4 Some computations

Besides manipulating strings and performing pattern matching operations, we can also do some computations. For instance, we could ask for the distribution of the State names’ length. To find the answer we can use nchar(). Furthermore, we can plot a histogram of such distribution:

summary(nchar(states))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    4.00    7.00    8.00    8.44   10.00   14.00

# histogram
hist(nchar(states), las = 1, col = "gray80", main = "Histogram", 
     xlab = "number of characters in US State names")

Let’s ask a more interesting question. What is the distribution of the vowels in the names of the States? For instance, let’s start with the number of a’s in each name. There’s a very useful function for this purpose: regexpr(). We can use regexpr() to get the number of times that a searched pattern is found in a character vector. When there is no match, we get a value -1.

# position of a's
positions_a <- gregexpr(pattern="a", text=states, ignore.case = TRUE)

# how many a's?
num_a <- sapply(positions_a, function(x) ifelse(x[1]>0, length(x), 0))
num_a
#>  [1] 4 3 2 3 2 1 0 2 1 1 2 1 0 2 1 2 0 2 1 2 2 1 1 0 0 2 2 2 1 0 0 0 2 2 0 2 0 2
#> [39] 1 2 2 0 1 1 0 1 1 1 0 0

If you inspect positions_a you’ll see that it contains some negative numbers -1. This means there are no letters a in that name. To get the number of occurrences of a’s we are taking a shortcut with sapply().

The same operation can be performed by using the function str_count() from the package "stringr".

# load stringr (remember to install it first)
library(stringr)

# total number of a's
str_count(states, "a")
#>  [1] 3 2 1 2 2 1 0 2 1 1 2 1 0 2 1 2 0 2 1 2 2 1 1 0 0 2 2 2 1 0 0 0 2 2 0 2 0 2
#> [39] 1 2 2 0 1 1 0 1 1 1 0 0

Notice that we are only getting the number of a’s in lower case. Since str_count() does not contain the argument ignore.case, we need to transform all letters to lower case, and then count the number of a’s like this:

# total number of a's
str_count(tolower(states), "a")
#>  [1] 4 3 2 3 2 1 0 2 1 1 2 1 0 2 1 2 0 2 1 2 2 1 1 0 0 2 2 2 1 0 0 0 2 2 0 2 0 2
#> [39] 1 2 2 0 1 1 0 1 1 1 0 0

Once we know how to do it for one vowel, we can do the same for all the vowels:

# calculate number of vowels in each name
vowels <- c("a", "e", "i", "o", "u")
num_vowels <- vector(mode = "integer", length = 5)

for (j in seq_along(vowels)) {
  num_aux <- str_count(tolower(states), vowels[j])
  num_vowels[j] <- sum(num_aux)
}

# sort them in decreasing order
names(num_vowels) <- vowels
sort(num_vowels, decreasing = TRUE)
#>  a  i  o  e  u 
#> 61 44 36 28  8

# barplot
barplot(num_vowels, main = "Number of vowels in USA States names", 
        border = NA, xlim = c(0, 80), las = 1, horiz = TRUE)