16 Data Cleaning

16.1 Introduction

In this application we are going to work with the Men’s Long Jump World Record Progression data from wikipedia (see screenshot below).


Men's Long Jump World Record Progression

Figure 16.1: Men’s Long Jump World Record Progression

16.2 Import Data

To import the data of the Record Progression table you can use a couple of functions from the package rvest.

The function read_html() reads the html file of the wikipedia page. This will produce an object of type "xml_document" which we can further manipulate with other functions in "rvest".

Because the Record progression data is in an html table node, you can use html_node() to locate such table in the XML document. And then extract it with html_table().

As you can tell, the extracted table tbl is a data frame with 18 rows and 5 columns.

16.3 Extracting Meters

The first task consists of looking at the values in column Mark, and find how to retrieve the distance values expressed in meters. For example, the first element in Mark is:

The goal is to obtain the number 7.61. One way to achieve this task is via the substr() function.

We can do that for the entire vector:

Notice that the meter values are not really numeric but character. In order to have meters as numbers, we should coerce them with as.numeric()

Extracting Meters with Regular Expressions

Instead of using the function substr() to obtain the distance values, let’s see how to achieve the same task using regular expressions. To do this we must determine a pattern to be matched. So, what is the pattern that all the distance values (in meters) have in common?

If you look at the Mark content, you will notice that the target pattern is formed by: a digit, followed by a dot, followed by two digits. Such pattern can be codified as: "[0-9]\\.[0-9][0-9]". So let’s test it and see if there’s match:

To extract the distance pattern we use str_extract()

And then we can apply it on the entire column to get:

16.5 Cleaning Dates

Now let’s consider tha values in column Date:

Notice that all the date values are formed by the day-number, the name of the month, the year, and then the characters [1]. Obviously we don’t need those last three characters [1].

First let’s see how to match the pattern [1]. Perhaps the first option that an inexperience user would try is:

According to str_detect(), there’s is a match, so let’s see what exactly "[1]" is matching:

Mmmm, not quite right. We are matching the character "1" but not "[1]". Why? Because brackets are metacharacters. So in order to match brackets as brackets we need to escape them:

Now we are talking. The next step involves using str_replace() to match the pattern "\\[1\\]" and replace it with an empty string "":

Then, we can get an entire vector of clean dates:

16.6 Month and Day

We can further manipulate the dates. For example, say we are interested in extracting the name of the month. In the first date, this corresponds to extracting "August":

How can we do that? Several approaches can be applied in this case. For example, let’s inspect the format of the month names:

They all begin with an upper case letter, followed by the rest of the characters in lower case. If we want to match month names formed by four letters (e.g. June, July), we could look for the pattern "[A-Z][a-z][a-z][a-z]"

The previous pattern "[A-Z][a-z][a-z][a-z]" not only matches "June" and "July" but also "Augu", "Sept", "Octo". In addition, we have some missing values.

Because the month names have variable lengths, we can use a repetition or quantifier operator. More specifically, we could look for the pattern "[A-Z][a-z]+", that is: an upper case letter, followed by a lower case letter, repeated one or more times. The plus + tells the regex engine to attempt to match the preceding token once or more:

Having extracted the name of the months, we can take advantage of a similar pattern to extract the days. How? Using a pattern formed by one digit range and the plus sign: "[0-9]+"

16.7 Extracting Year

What about extracting the year number?

One option that we have discussed already is to use substr() or str_sub()

or simply indicate a negative starting position (to counting from the end of the string):

Another option consists in using a pattern formed by four digits: "[0-9][0-9][0-9][0-9]":

An additional option consists in using an end of string anchor with the metacharacter "$" (dollar sign), and combine with a repetition operator "+" like: "[0-9]+$":

What is this pattern doing? The part of the pattern "[0-9]+" indicates that we want to match one or more digits. In order to tell the engine to match the pattern at the end of the string, we must use the anchor "$".

The same task can be achieved with a digit character class \\d and the repetition operator +:

16.8 Athlete Names

Now let’s try to extract the athletes’ first and last names. We could specify a regex pattern for the first name [A-Z][a-z][A-Z]?[a-z]+, followed by a space, followed by an uper case letter, and one or more lower case letters [A-Z][a-z]+:

What about the first athlete Peter O’Connor? The previous pattern does not include the apostrophe.

What about this other pattern?

Recall that the quantifier (or repetition) operators have an effect on the preceding token. So, the pattern "[A-Z]'[A-Z]?[a-z]+" means: an upper case letter, followed by an apostrophe, followed by an optional upper case, followed by one or more lower case letters. In other words, the quantifier "?" only has an effect on the second upper case letter.

In reality, we want both the apostrophe and the second upper case letters to be optional, so we need to add quantifiers "?" to both of them:

If you want to treat a set of characters as a single unit, you must wrap them inside parentheses:

We still have an issue with athlete Igor Ter-Ovanesyan. The patterns used so far are only matching the the characters in his last name before the hyphen. We can start by adding a escaped hyphen inside the character set "[a-z\\-]" at the end of the pattern:

Notice that this pattern does match the hyphen but fails to match the second part of the last name (the one after the hyphen). This is because our token is only matching lower case letters. So we also need to include upper case letters in the character set: "[a-zA-Z\\-]"

The regex patterns that involve a set such as "[a-zA-Z]" can be simplified with a repeated word character class "\\w+" (recall that "\\w+" is equivalent to "[0-9A-Za-z_]"). We can try to use two repeated word classes:

As you know, we also need to include an apostrphe and the hyphen. In this case, we can include them inside parentheses and separating them with the OR operator "|":

Make a donation

If you find this resource useful, please consider making a one-time donation in any amount. Your support really matters.