14 Data Log File

In this example, we’ll be using the text file logfile.txt located in the data/ folder of the book’s github repository:


This file is a server log file that contains the recorded events taking place in a web server. The content of the file is in a special format known as common log format. According to wikipedia:

“The Common Log Format is a standardized text file format used by web servers when generating server log files.”

Here’s an example of a log record; the text should in one line of code, but I’ve split it into 2 lines for readibility purposes:

pd9049dac.dip.t-dialin.net - - [01/May/2001:01:51:25 -0700] 
"GET /accesswatch/accesswatch-1.33/ HTTP/1.0" 200 1004
  • A "-" in a field indicates missing data.
  • pd9049dac.dip.t-dialin.net is the IP address of the client (remote host) which made the request to the server.
  • [01/May/2001:01:51:25 -0700] is the date, time, and time zone that the request was received, by default in strftime format%d/%b/%Y:%H:%M:%S %z.
  • "GET /accesswatch/accesswatch-1.33/ HTTP/1.0" is the request line from the client.
  • The method GET, /accesswatch/accesswatch-1.33/ is the resource requested, and HTTP/1.0 is the HTTP protocol.
  • 200 is the HTTP status code returned to the client.
    • 2xx is a successful response
    • 3xx a redirection
    • 4xx a client error, and
    • 5xx a server error
  • 1004 is the size of the object returned to the client, measured in bytes.

If you want to download a copy of the text file to your working directory run the following code:

14.1 Reading the text file

The first step involves reading the data in R. How can you do this? One option is with the readLines() function which reads any text file into a character vector:

Let’s take a peek at the content of the vector logs:

Because the file contains 26033 lines (or elements), let’s get a subset by taking a random sample of size 50:

14.1.1 JPG File Requests

To begin our regex experiments, let’s try to find out “how many requests involved a JPG file?”. One way to answer the previous question is by counting the number of lines containing the pattern "jpg". We can use grep() to match or detect this pattern:

We can try to be more specific by defining a pattern ".jpg" in which the . corresponds to the literal dot character. To match the dot, we need to escape it with "\\.":

A similar output of grep() can be obtained with str_detect(), which allows you to detect what elements contain a match to the specified pattern:

We can do the same for PNG extensions (or for GIF or ICO):

14.1.2 Extracting file extensions

Another common task when working with regular expressions has to do with pattern extraction. For this purposes, we can use str_extract():

str_extract() actually let us confirm that we are matching the desired patterns. Notice that when there is no match, str_extract() returns a missing value NA.

14.1.4 How to match image files with one regex pattern?

We can use character sets to define a more generic pattern. For instance, to match "jpg" or "png", we could join three character sets: "[jp][pn][g]". The first set [jp] looks for j or p, the second set [pn] looks for p or n, and the third set simply looks for g.

Including the dot, we can use: "\\.[jp][pn][g]"

We could generalize the pattern to include the GIF and ICO extensions:

To confirm that we are actually matching jpg, png, gif and ico, let’s use str_extract()

The previous pattern does not really work as expected: note that we are matching the patterns formed by "ing" and "inf" which do not correspond to image file extensions.

An alternative way to detect JPG and PNG is by grouping patterns inside parentheses, and separating them with the metacharacter "|" which means OR:

Here’s how to detect all the extension in one single pattern:

To make sure our regex operation is successful, let’s see the output of str_extract():

There’s some repetition with the dot character; we can modify our previous pattern by placing the dot "\\." at the beginning:

Notice that the dot only appears next to ".jpg" but not with the other type of extensions. What we need to do is group the file extensions by surrounding them with parentheses:

Now let’s apply the pattern on the entire log file, to count the number of files of each type: