14 Quantifiers

by Chitra Venkatesh

Quantifiers quantify the number of instances of a character, group or character class, denoted as P in the table below. Your quantifier should be placed after the character/group/character class that is being quantified. We will also explain what groups are in this context.

Quantifier Description
P* 0 or more instances of P
P+ 1 or more instances of P
P? 0 or 1 instance of P
P{m} Exactly m instances of P
P{m,} At least m instances of P
P{m,n} Between m and n instances of P

In the following example, let us try to extract all those names that contain more than 4 characters and less than 7 characters.

student_names <- c("Lee", "Carol", "Sameer", "Luca", "Rajan", "George Jr.")

str_extract(student_names, regex("^[A-z]{5,7}$"))
#> [1] NA       "Carol"  "Sameer" NA       "Rajan"  NA

In the above example, we used anchors ^ and $ to indicate an exact match. In absence of which a substring of George Jr. also gets displayed.

Let’s try to detect names of those individuals with one or more e or u.

str_detect(student_names, regex("[eu]+"))
#> [1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE

In the last example, if we want to extract names that contain e or u we could follow this simple implementation . Points to note here:

  • Character set [eu] could appear 1 or more times so we use quantifier +.

  • .* matches 0 or any number of characters where . is a wildcard dot and * represents the quantifier 0 or many

  • Pattern .*[eu]+.* looks for 1 or more numbers of [eu] that can be preceeded/followed by any number of other characters.

student_names <- c("Lee", "Carol", "Sameer", "Luca", "Rajan", "George Jr.")

str_extract(student_names, regex(".*[eu]+.*"))
#> [1] "Lee"        NA           "Sameer"     "Luca"       NA          
#> [6] "George Jr."

14.0.1 What do groups mean in Regex?

We visited character classes in one of the sections. For situations where we would like to group character classes or regex pattern before using a quantifier, we indicate grouping using paranthesis.

Consider an example where we would like to extract only strings with two names separated by a whitespace. For illustrative purpose, the strings end with a whitespace.

student_names <- c(
  "Lee Zhang ", 
  "Carol Roberts ", 
  "Sameer ", 
  "Luca ", 
  "Rajan ", 
  "George Smith ")

str_extract(student_names, regex("([A-z]+[ ]){2}"))
#> [1] "Lee Zhang "     "Carol Roberts " NA               NA              
#> [5] NA               "George Smith "

We could also use pre-built class [:alpha:] in the above example.

str_extract(student_names, regex("([:alpha:]+[ ]){2}"))
#> [1] "Lee Zhang "     "Carol Roberts " NA               NA              
#> [5] NA               "George Smith "