20 Importing Data Tables

The most common type of data-import operation in R has to do with importing data tables. Because of this, I have decided to exclusively dedicate this chapter to review this type of operation. In a nutshell, I describe various standard ways to import tabular data. By “standard” I mean using base functions such as read.table() and friends. This is in contrast to alternative functions from packages such as "readr", and other packages.

20.1 Motivation

Suppose we have data about the accounts of an individual’s portfolio (see table below). There are five different accounts, the type of bank (brick-and-mortar or online) associated to the account, the annual rate of return, and the balance amount (in dollars).

account bank rate balance
savings brick-n-mortar 0.020 1000
money market online 0.025 2000
certificate brick-n-mortar 0.030 3000
brokerage online 0.070 5000
retirement online 0.050 9000

A data table like this one could be stored in various types of files. For example, it could be stored in a Comma Separated Value (CSV) file, which is a very common kind of text file format used to store tables. Another possible way in which this table could be stored is in a spreadsheet (e.g. Excel, Google Sheets, Mac Numbers).

20.1.1 Data in Text Files

Let’s assume that the table is stored in a text file. Because the term text file is used in slightly different ways, let me be a bit more specific about what I mean by text file:

  • By text file I mean a plain text file that can be read and manipulated with a text editor (not to be confused with a word-processor)

  • Plain text as an umbrella term for any file that is in a human-readable form, with comon file extensions such as .txt, .csv, .xml, .html.

  • Text file implying that its content is stored as a sequence of characters

  • Each character stored as a single byte of data

  • Data is arranged in rows, with several values stored on each row

20.2 Character Delimited Text Files

A common way to store data tables is via text files. But how is this actually done? How can a data table be stored in a text file?

How are data tables stored in text files?

Figure 20.1: How are data tables stored in text files?

You may not have thought about this before. This was certainly the case for me back when I was an undergraduate student. I could perfectly picture a data table in my head, with a bunch of rows and columns forming its grid-like structure. But I never considered a table’s storage format in a text file.

Think about it: a table is made of cells that come from the intersection of rows and columns. So the two fundamental questions when storing data in a text file are: 1) how to represent rows?, and 2) how to represent columns? Or in other words: how to convey the notion of cells in a text file?

The answer to this question comes from realizing what a data cell does. A cell is basically a placeholder for a data value. Cells let us separate or delimit one data value from another. So all we have to do is to come up with a delimiter or separator to differentiate from one data point to the next one. The way this is done is by choosing a certain character as the delimiter. As for the notion of rows, we can simply use newlines.

Here is an example of how the accounts data table can be store in a text file, using commas "," as the character to separate values:

account,bank,rate,balance
savings,bricknmor,0.020,1000
monmarket,online,0.025,2000
certificate,bricknmor,0.030,3000
brokerage,online,0.070,5000
retirement,online,0.050,9000

20.2.1 Common Delimiters

As you can tell, the idea of storing data tables via text files is very simple and clever. More formally, this storage option is commonly referred to as field-delimiter formats. The term “field” is a synonym of variable or column. The term “delimiter” indicates that a certain character is used to delimit the values of every column.

Common examples of such delimiters are blank space characters " ", tab characters "\t", or comma characters ",".

Delimiter Description
" " white space
"," comma
"\t" tab
";" semicolon

As a matter of fact, you can come up with your own delimiter. For instance, I can use vertical bars "|" as delimiters:

account|bank|rate|balance
savings|bricknmor|0.020|1000
monmarket|online|0.025|2000
certificate|bricknmor|0.030|3000
brokerage|online|0.070|5000
retirement|online|0.050|9000

Or even use double colons as delimiters:

account::bank::rate::balance
savings::bricknmor::0.020::1000
monmarket::online::0.025::2000
certificate::bricknmor::0.030::3000
brokerage::online::0.070::5000
retirement::online::0.050::9000

To be honest, if you have to choose a certain character for delimiting purposes, I would recommend not to be creative, and instead to stick with one of the conventional delimiters. Below are some examples of what the content of a text file—storing the accounts table—could look like when using different delimiters.

Space Delimited

Example of a space delimited file (common file extension .txt)

account bank rate balance
savings bricknmor 0.020 1000
monmarket online 0.025 2000
certificate bricknmor 0.030 3000
brokerage online 0.070 5000
retirement online 0.050 9000

Tab Delimited

Example of a tab delimited file (common file extensions .txt or .tsv)

account   bank    rate    balance
savings   bricknmor   0.020   1000
monmarket   online    0.025   2000
certificate   bricknmor   0.030   3000
brokerage   online    0.070   5000
retirement    online    0.050   9000

Semicolon Delimited

Example of a semicolon delimited file (common file extension .csv)

account;bank;rate;balance
savings;bricknmor;0.020;1000
monmarket;online;0.025;2000
certificate;bricknmor;0.030;3000
brokerage;online;0.070;5000
retirement;online;0.050;9000

20.2.2 Delimiters in Data Values

What happens if the chosen delimiter is actually part of a data value? As an example of this type of situation, let’s modify the accounts data table. In particular, suppose that the values in column balance contain a comma to better distinguish numbers in thousands of dollars:

account bank rate balance
savings brick-n-mortar 0.020 1,000
money market online 0.025 2,000
certificate brick-n-mortar 0.030 3,000
brokerage online 0.070 5,000
retirement online 0.050 9,000

If commas are also used as delimiters, we will have a conflict because the commas in balance values are not intended to be delimiters:

account,bank,rate,balance
savings,bricknmor,0.020,1,000
monmarket,online,0.025,2,000
certificate,bricknmor,0.030,3,000
brokerage,online,0.070,5,000
retirement,online,0.050,9,000

When something like this happens, to avoid a conflict between the delimiting character and its usage as part of a data value, the convention is to surround the values within quotes (typically double quotes), like this:

account,bank,rate,balance
savings,bricknmor,0.020,"1,000"
monmarket,online,0.025,"2,000"
certificate,bricknmor,0.030,"3,000"
brokerage,online,0.070,"5,000"
retirement,online,0.050,"9,000"

20.2.3 Fixed-Width Format Files

In addition to character-delimited files, there is another kind of plain text format that can also be used to store data tables: the so-called fixed-width formats or fwf for short.

In a fixed-width format file, the way in which values are separated is not with a character. Instead, values in a column are given a fixed width expressed in a certain number of characters.

Here’s an example of a fixed width delimited file (common file extension .txt). For sake of illustration, I’m adding a sequence of digits at the top. In real life, the first line of digits is not supposed to be part of the file.

12345678901234567890123456789012345
account     bank      rate  balance
savings     bricknmor 0.020 1000
monmarket   online    0.025 2000
certificate bricknmor 0.030 3000
brokerage   online    0.070 5000
retirement  online    0.050 9000

All the data of the first column account is given a width of 12 characters (i.e. digits 123456789012). In turn, all the data in column bank fits within a width of 10 characters (i.e. digits 3456789012), and so on.

20.2.4 Summary

To summarize:

  • A common way to store data in tabular form is via text files

  • To store the data we need a way to separate data values

  • Each line represents a “row”

  • The idea of “columns” is conveyed with delimiters

  • Fields within each line are separated by the delimiter

  • Quotation marks are used when the delimiter character occurs within one of the fields

20.3 Importing Data Tables

Now that we have talked about how data tables are stored in text files, and the various formats in which this is done, let’s review the set of functions to import tables in R, as well as important considerations to keep in mind when importing data in general.

Considerations before importing a data table in R:

  • What is the character used as field delimiter?

  • Does the file contain names of columns?

  • Does the file contain a column for row names?

  • Are there any missing values? If yes, how are missing values encoded?

  • Do you want to read in all rows, or just some of them?

  • Do you want to read in all columns, or just some of them?

  • Do you need to convert delimiter characters? (e.g. from space to comma)

  • Can you determine the data-type of each column?

  • Are there any uninformative numbers?

  • Can you convert those uninformative numbers to informative labels?

20.3.1 Function read.table() and friends

The most common way to read and import tables in R is by using read.table() and friends such as read.csv(), read.delim(), etc. If you take a look at the arguments of read.table() you’ll see that this function has more than 20 of them. Actually, if you inspect the help documentation, read.table() comes with the following usage:

read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", row.names, col.names,
           as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE,
           fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown", text,
           skipNul = FALSE)

As I mention in the foregoing chapter, the mandatory input for read.table() and friends is file which is the name of the file which the data are to be read from. On the more technical side, recall that there is a connection() function used by R related to the file argument.

The following table lists some read.table() arguments and their meaning:

Argument Description
file Name of file
header Whether column names are in 1st line
sep Character used as field separator
quote Quoting characters
dec Character for decimal point
row.names Optional vector of row names
col.names Optional vector of column names
na.strings Characters treated as missing values
colClasses Optional vector of data types for columns
nrows Maximum number of rows to read in
skip Number of lines to skip before reading data
check.names Check valid column names
stringsAsFactors Should characters be converted to factors

In versions of R < 4.0.0, read.table() and friends convert character strings into factors by default.

In addition to read.table(), there’s a handful of sibling functions to read other types of text files (see table below). By the way, all these functions are actually wrappers of read.table():

Function Description
read.csv() import comma separated values
read.csv2() import semicolon separated values (Europe)
read.delim() import tab separated values
read.delim2() import tab separated values (Europe)

20.3.2 Reading space-separated files

Let’s review some examples that illustrate the use of the read-table functions. For simplicity’s sake, we’ll assume that all data files are located in your working directory.

Suppose you have the following data in a file: accounts.txt

account bank rate balance
savings bricknmor 0.020 1000
monmarket online 0.025 2000
certificate bricknmor 0.030 3000
brokerage online 0.070 5000
retirement online 0.050 9000

Example 1. Importing table in blank separated file

# using read.table()
dat <- read.table(
  file = "accounts.txt", 
  header = TRUE)

The imported data frame dat will be:

>        account       bank   rate  balance
> 1      savings  bricknmor  0.020     1000
> 2    monmarket     online  0.025     2000
> 3  certificate  bricknmor  0.030     3000
> 4    brokerage     online  0.070     5000
> 5   retirement     online  0.050     9000

Example 2. Limit the number of rows to read in (first 2 rows):

dat <- read.table(
  file = "accounts.txt", 
  header = TRUE,
  nrows = 2)

The imported data frame dat will be:

>      account       bank   rate  balance
> 1    savings  bricknmor  0.020     1000
> 2  monmarket     online  0.025     2000

Example 3. Skip the first row (no header) and limit the number of rows to read in (4 rows)

dat <- read.table(
  file = "accounts.txt", 
  header = FALSE,
  skip = 1,
  nrows = 4)

The imported data frame dat will be:

>             V1         V2     V3    V4
> 1      savings  bricknmor  0.020  1000
> 2    monmarket     online  0.025  2000
> 3  certificate  bricknmor  0.030  3000
> 4    brokerage     online  0.070  5000

Example 4. Skip importing the second and third columns

dat <- read.table(
  file = "accounts.txt", 
  header = TRUE, 
  colClasses = c(
    "character",
    "NULL",
    "NULL",
    "numeric"))

The imported data frame dat will be:

>        account  balance
> 1      savings     1000
> 2    monmarket     2000
> 3  certificate     3000
> 4    brokerage     5000
> 5   retirement     9000

20.3.3 Reading comma-separated files with read.csv()

Let’s now consider a data table stored in a CSV file: accounts.csv

account,bank,rate,balance
savings,bricknmor,0.020,1000
monmarket,online,0.025,2000
certificate,bricknmor,0.030,3000
brokerage,online,0.070,5000
retirement,online,0.050,9000

Example 1. Data in comma separated value (CSV) file

We can use read.table()

# using read.table()
dat <- read.table(
  file = "accounts.csv",
  header = TRUE,
  sep = ",")

Or more conveniently, we can use read.csv()

# using read.csv()
dat <- read.csv(file = "accounts.csv")

The imported data frame dat will be:

>        account       bank   rate  balance
> 1      savings  bricknmor  0.020     1000
> 2    monmarket     online  0.025     2000
> 3  certificate  bricknmor  0.030     3000
> 4    brokerage     online  0.070     5000
> 5   retirement     online  0.050     9000

20.4 Importing with scan()

To finish this chapter, I want to briefly talk to you about the scan() function. This is another function that you can use to read data into a vector or list from the console or from a file. As a matter of fact, scan() is under the hood of read.table() and friends. Because of this, I often refer to scan() as a low-level function.

Let’s see an example for how to use scan() with a space delimited file. Suppose that we have the data in the file accounts.txt, located in our working directory.

account bank rate balance
savings bricknmor 0.020 1000
monmarket online 0.025 2000
certificate bricknmor 0.030 3000
brokerage online 0.070 5000
retirement online 0.050 9000

scan() is what I call a “low-level function”. What I mean by that is that we typically need to write more code when using this kind of functions. To be more precise, we usually have to specify more arguments when calling scan() than when calling other high-level functions like read.table() and friends.

Like all other data-import functions, the first argument of scan() is the the name of the file (i.e. file path) to be imported. The second argument is called what, and this has to do with the type of data to be read.

If you look at the content of accounts.txt, the first line contains the column names: account bank rate balance. Because the data in this first line is to be used as names, their data-type is "character".

The rest of the lines have to do with the actual data of the accounts. Each line contains four pieces of data:

  • the kind of account, to be encoded as "character"

  • the kind of bank, to be encoded also as "character"

  • the rate of return, to be encoded as "double" (or "numeric")

  • the balance amount, to be encoded as "integer" or if you prefer as "double" (or generically as "numeric")

In order to import the accounts.txt data into R, we must use a two-step process:

  1. first we import the column names into a character vector

  2. then we import the rest of the lines into a list

Here is how to do it:

# step 1) scan column names
# ins a character vector
header = scan(
  file = "accounts.txt", 
  what = list("", "", "", ""), 
  nmax = 1)

header
> [1] "account" "bank"    "rate"    "balance"
# step 2) scan the rest of the lines
# into a list
dat_list = scan(
  file = "accounts.txt", 
  what = list(character(), character(), double(), integer()),
  skip = 1)

dat_list
> [[1]]
> [1] "savings"     "monmarket"   "certificate" "brokerage"  
> [5] "retirement" 
> 
> [[2]]
> [1] "bricknmor" "online"    "bricknmor" "online"   
> [5] "online"   
> 
> [[3]]
> [1] 0.020 0.025 0.030 0.070 0.050
> 
> [[4]]
> [1] 1000 2000 3000 5000 9000

Once we have the vector of column names header and the list with the data for each of the columns, we can then assemble the data into a data table using the as.data.frame() function as follows:

dat = as.data.frame(dat_list, col.names = header)
dat
>       account      bank  rate balance
> 1     savings bricknmor 0.020    1000
> 2   monmarket    online 0.025    2000
> 3 certificate bricknmor 0.030    3000
> 4   brokerage    online 0.070    5000
> 5  retirement    online 0.050    9000

20.5 In Summary

When importing a data table, you should keep in mind the following considerations.

What is the field separator?

  • space " "

  • tab "\t"

  • comma ","

  • semicolon ";"

  • other?

Does the data file contains:

  • row names?

  • column names?

  • missing values?

  • special characters?

So far …

  • There are multiple ways to import data tables

  • The workhorse function is read.table()

  • But you can use the other wrappers, e.g. read.csv()

  • The output is a "data.frame" object