11  Quantifiers

Another important set of regex metacharacters are the so-called quantifiers. These are used when we want to match a certain number of characters that meet certain criteria.

11.1 Quantifier Metacharacters

As the name indicates, quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.

The following table shows the regex quantifiers. The quantifier should be placed after the character, group or character class that is being quantified, denoted as c in the table below.

Quantifier Description
c? The preceding item is optional and will be matched at most once
c* The preceding item will be matched zero or more times
c+ The preceding item will be matched one or more times
c{n} The preceding item is matched exactly n times
c{n,} The preceding item is matched n or more times
c{n,m} The preceding item is matched at least n times, but no more than m times

For illustration purposes, let’s create a vector people

people <- c(
  "rori", 
  "emilia", 
  "matteo", 
  "mehmet", 
  "filipe", 
  "ana", 
  "victoria")

people
[1] "rori"     "emilia"   "matteo"   "mehmet"   "filipe"   "ana"      "victoria"

We start with a simple example extracting all those names that contain at least five characters but no more than 7 characters. To do this, we define a pattern formed by the start anchor ^, followed by a range of upper and lower case letters [A-z], followed by the repetition pattern {5,7}, followed by the end anchor $

str_extract(people, "^[A-z]{5,7}$")
[1] NA       "emilia" "matteo" "mehmet" "filipe" NA       NA      

The reason why we use anchors ^ and $ is to make sure that we have an exact match.

Let’s try to detect names of those individuals with one or more a or e.

str_detect(people, "[ae]+")
[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
people[str_detect(people, "[ae]+")]
[1] "emilia"   "matteo"   "mehmet"   "filipe"   "ana"      "victoria"

In the last example, if we want to extract names that contain a or e we could follow this simple implementation. Points to note here:

  • Character set [ae] could appear 1 or more times so we use the quantifier +.

  • .* matches 0 or any number of characters where . is a wildcard dot and * represents the quantifier 0 or many .

  • Pattern .*[ae]+.* looks for 1 or more occurrences of [ae] that can be preceeded/followed by any number of other characters.

people <- c(
  "rori", 
  "emilia", 
  "matteo", 
  "mehmet", 
  "filipe", 
  "ana", 
  "victoria")

str_extract(people, regex(".*[ae]+.*"))
[1] NA         "emilia"   "matteo"   "mehmet"   "filipe"   "ana"      "victoria"

11.1.1 What do groups mean in Regex?

We visited character classes in one of the sections. For situations where we would like to group character classes or regex pattern before using a quantifier, we indicate grouping using paranthesis.

Consider an example where we would like to extract only strings with two names separated by a whitespace. For illustrative purpose, the strings end with a whitespace.

people <- c(
  "rori rholfs", 
  "emilia huerta ", 
  "matteo fumagalli ", 
  "mehmet ", 
  "filipe vieira", 
  "ana chen", 
  "victoria kim ")

str_extract(people, "([A-z]+[ ]){2}")
[1] NA                  "emilia huerta "    "matteo fumagalli "
[4] NA                  NA                  NA                 
[7] "victoria kim "    

We could also use pre-built class [:alpha:] in the above example.

str_extract(people, regex("([:alpha:]+[ ]){2}"))
[1] NA                  "emilia huerta "    "matteo fumagalli "
[4] NA                  NA                  NA                 
[7] "victoria kim "    

11.2 Greedy vs Lazy Match

As you might have noted in previous cases, regex tend to return greedy results., i.e., the longest match possible for a given expression. Let’s explore this idea further and see if we can force it to be lazy.

Consider again one of our previous examples, but this time extracting those names that contain at least four characters but no more than 6 characters:

people <- c(
  "rori", 
  "emilia", 
  "matteo", 
  "mehmet", 
  "filipe", 
  "ana", 
  "victoria")

str_extract(people, "^[A-z]{4,6}$")
[1] "rori"   "emilia" "matteo" "mehmet" "filipe" NA       NA      

The quantifier {4,6} returned a greedy match, i.e., the result was of the maximum length possible.

Let us remove the anchors to see whether it is indeed greedy. By removing the anchors, it prints the first 4-6 characters of all names, provided name length is at the minimum 4.

str_extract(people, regex("[A-z]{4,6}"))
[1] "rori"   "emilia" "matteo" "mehmet" "filipe" NA       "victor"

We could make it lazy by adding a ? after the quantifier. For names emilia, matteo, mehmet, filipe and victoria, it prints only the first 4 characters.

str_extract(people, regex("[A-z]{4,6}?"))
[1] "rori" "emil" "matt" "mehm" "fili" NA     "vict"

Similarly, we could make other quantifiers lazy.

Original Quantifier (Greedy) Lazy Version
c? c??
c* c*?
c+ c+?
c{n} c{n}?
c{n,} c{n,}?
c{n,m} c{n,m}?