17 Importing Data Tables
The most common type of data-import operation in R has to do with importing data tables. Because of this, I have decided to exclusively dedicate this chapter to review this type of operation. In a nutshell, I describe various standard ways to import tabular data. By “standard” I mean using base functions such as read.table()
and friends. This is in contrast to alternative functions from packages such as "readr"
, and other packages.
17.1 Motivation
Suppose we have data about the accounts of an individual’s portfolio (see table below). There are five different accounts, the type of bank (brick-and-mortar or online) associated to the account, the annual rate of return, and the balance amount (in dollars).
account | bank | rate | balance |
---|---|---|---|
savings | brick-n-mortar | 0.020 | 1000 |
money market | online | 0.025 | 2000 |
certificate | brick-n-mortar | 0.030 | 3000 |
brokerage | online | 0.070 | 5000 |
retirement | online | 0.050 | 9000 |
A data table like this one could be stored in various types of files. For example, it could be stored in a Comma Separated Value (CSV) file, which is a very common kind of text file format used to store tables. Another possible way in which this table could be stored is in a spreadsheet (e.g. Excel, Google Sheets, Mac Numbers).
17.1.1 Data in Text Files
Let’s assume that the table is stored in a text file. Because the term text file is used in slightly different ways, let me be a bit more specific about what I mean by text file:
By text file I mean a plain text file that can be read and manipulated with a text editor (not to be confused with a word-processor)
Plain text as an umbrella term for any file that is in a human-readable form, with comon file extensions such as
.txt
,.csv
,.xml
,.html
.Text file implying that its content is stored as a sequence of characters
Each character stored as a single byte of data
Data is arranged in rows, with several values stored on each row
17.2 Character Delimited Text Files
A common way to store data tables is via text files. But how is this actually done? How can a data table be stored in a text file?
You may not have thought about this before. This was certainly the case for me back when I was an undergraduate student. I could perfectly picture a data table in my head, with a bunch of rows and columns forming its grid-like structure. But I never considered a table’s storage format in a text file.
Think about it: a table is made of cells that come from the intersection of rows and columns. So the two fundamental questions when storing data in a text file are: 1) how to represent rows?, and 2) how to represent columns? Or in other words: how to convey the notion of cells in a text file?
The answer to this question comes from realizing what a data cell does. A cell is basically a placeholder for a data value. Cells let us separate or delimit one data value from another. So all we have to do is to come up with a delimiter or separator to differentiate from one data point to the next one. The way this is done is by choosing a certain character as the delimiter. As for the notion of rows, we can simply use newlines.
Here is an example of how the accounts data table can be store in a text file, using commas ","
as the character to separate values:
account,bank,rate,balance
savings,bricknmor,0.020,1000
monmarket,online,0.025,2000
certificate,bricknmor,0.030,3000
brokerage,online,0.070,5000
retirement,online,0.050,9000
17.2.1 Common Delimiters
As you can tell, the idea of storing data tables via text files is very simple and clever. More formally, this storage option is commonly referred to as field-delimiter formats. The term “field” is a synonym of variable or column. The term “delimiter” indicates that a certain character is used to delimit the values of every column.
Common examples of such delimiters are blank space characters " "
, tab characters "\t"
, or comma characters ","
.
Delimiter | Description |
---|---|
" " |
white space |
"," |
comma |
"\t" |
tab |
";" |
semicolon |
As a matter of fact, you can come up with your own delimiter. For instance, I can use vertical bars "|"
as delimiters:
account|bank|rate|balance
savings|bricknmor|0.020|1000
monmarket|online|0.025|2000
certificate|bricknmor|0.030|3000
brokerage|online|0.070|5000
retirement|online|0.050|9000
Or even use double colons as delimiters:
account::bank::rate::balance
savings::bricknmor::0.020::1000
monmarket::online::0.025::2000
certificate::bricknmor::0.030::3000
brokerage::online::0.070::5000
retirement::online::0.050::9000
To be honest, if you have to choose a certain character for delimiting purposes, I would recommend not to be creative, and instead to stick with one of the conventional delimiters. Below are some examples of what the content of a text file—storing the accounts table—could look like when using different delimiters.
Space Delimited
Example of a space delimited file (common file extension .txt
)
account bank rate balance
savings bricknmor 0.020 1000
monmarket online 0.025 2000
certificate bricknmor 0.030 3000
brokerage online 0.070 5000
retirement online 0.050 9000
Tab Delimited
Example of a tab delimited file (common file extensions .txt
or .tsv
)
account bank rate balance
savings bricknmor 0.020 1000
monmarket online 0.025 2000
certificate bricknmor 0.030 3000
brokerage online 0.070 5000
retirement online 0.050 9000
Semicolon Delimited
Example of a semicolon delimited file (common file extension .csv
)
account;bank;rate;balance
savings;bricknmor;0.020;1000
monmarket;online;0.025;2000
certificate;bricknmor;0.030;3000
brokerage;online;0.070;5000
retirement;online;0.050;9000
17.2.2 Delimiters in Data Values
What happens if the chosen delimiter is actually part of a data value? As an example of this type of situation, let’s modify the accounts data table. In particular, suppose that the values in column balance
contain a comma to better distinguish numbers in thousands of dollars:
account | bank | rate | balance |
---|---|---|---|
savings | brick-n-mortar | 0.020 | 1,000 |
money market | online | 0.025 | 2,000 |
certificate | brick-n-mortar | 0.030 | 3,000 |
brokerage | online | 0.070 | 5,000 |
retirement | online | 0.050 | 9,000 |
If commas are also used as delimiters, we will have a conflict because the commas in balance values are not intended to be delimiters:
account,bank,rate,balance
savings,bricknmor,0.020,1,000
monmarket,online,0.025,2,000
certificate,bricknmor,0.030,3,000
brokerage,online,0.070,5,000
retirement,online,0.050,9,000
When something like this happens, to avoid a conflict between the delimiting character and its usage as part of a data value, the convention is to surround the values within quotes (typically double quotes), like this:
account,bank,rate,balance
savings,bricknmor,0.020,"1,000"
monmarket,online,0.025,"2,000"
certificate,bricknmor,0.030,"3,000"
brokerage,online,0.070,"5,000"
retirement,online,0.050,"9,000"
17.2.3 Fixed-Width Format Files
In addition to character-delimited files, there is another kind of plain text format that can also be used to store data tables: the so-called fixed-width formats or fwf for short.
In a fixed-width format file, the way in which values are separated is not with a character. Instead, values in a column are given a fixed width expressed in a certain number of characters.
Here’s an example of a fixed width delimited file (common file extension .txt
). For sake of illustration, I’m adding a sequence of digits at the top. In real life, the first line of digits is not supposed to be part of the file.
12345678901234567890123456789012345
account bank rate balance
savings bricknmor 0.020 1000
monmarket online 0.025 2000
certificate bricknmor 0.030 3000
brokerage online 0.070 5000
retirement online 0.050 9000
All the data of the first column account
is given a width of 12 characters (i.e. digits 123456789012
). In turn, all the data in column bank
fits within a width of 10 characters (i.e. digits 3456789012
), and so on.
17.2.4 Summary
To summarize:
A common way to store data in tabular form is via text files
To store the data we need a way to separate data values
Each line represents a “row”
The idea of “columns” is conveyed with delimiters
Fields within each line are separated by the delimiter
Quotation marks are used when the delimiter character occurs within one of the fields
17.3 Importing Data Tables
Now that we have talked about how data tables are stored in text files, and the various formats in which this is done, let’s review the set of functions to import tables in R, as well as important considerations to keep in mind when importing data in general.
Considerations before importing a data table in R:
What is the character used as field delimiter?
Does the file contain names of columns?
Does the file contain a column for row names?
Are there any missing values? If yes, how are missing values encoded?
Do you want to read in all rows, or just some of them?
Do you want to read in all columns, or just some of them?
Do you need to convert delimiter characters? (e.g. from space to comma)
Can you determine the data-type of each column?
Are there any uninformative numbers?
Can you convert those uninformative numbers to informative labels?
17.3.1 Function read.table()
and friends
The most common way to read and import tables in R is by using read.table()
and friends such as read.csv()
, read.delim()
, etc. If you take a look at the arguments of read.table()
you’ll see that this function has more than 20 of them. Actually, if you inspect the help documentation, read.table()
comes with the following usage:
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", row.names, col.names,
as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE,
fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text,
skipNul = FALSE)
As I mention in the foregoing chapter, the mandatory input for read.table()
and friends is file
which is the name of the file which the data are to be read from. On the more technical side, recall that there is a connection()
function used by R related to the file
argument.
The following table lists some read.table()
arguments and their meaning:
Argument | Description |
---|---|
file |
Name of file |
header |
Whether column names are in 1st line |
sep |
Character used as field separator |
quote |
Quoting characters |
dec |
Character for decimal point |
row.names |
Optional vector of row names |
col.names |
Optional vector of column names |
na.strings |
Characters treated as missing values |
colClasses |
Optional vector of data types for columns |
nrows |
Maximum number of rows to read in |
skip |
Number of lines to skip before reading data |
check.names |
Check valid column names |
stringsAsFactors |
Should characters be converted to factors |
In versions of R < 4.0.0
, read.table()
and friends convert character strings into factors by default.
In addition to read.table()
, there’s a handful of sibling functions to read other types of text files (see table below). By the way, all these functions are actually wrappers of read.table()
:
Function | Description |
---|---|
read.csv() |
import comma separated values |
read.csv2() |
import semicolon separated values (Europe) |
read.delim() |
import tab separated values |
read.delim2() |
import tab separated values (Europe) |
17.3.2 Reading space-separated files
Let’s review some examples that illustrate the use of the read-table functions. For simplicity’s sake, we’ll assume that all data files are located in your working directory.
Suppose you have the following data in a file: accounts.txt
account bank rate balance
savings bricknmor 0.020 1000
monmarket online 0.025 2000
certificate bricknmor 0.030 3000
brokerage online 0.070 5000
retirement online 0.050 9000
Example 1. Importing table in blank separated file
# using read.table()
<- read.table(
dat file = "accounts.txt",
header = TRUE)
The imported data frame dat
will be:
account bank rate balance
1 savings bricknmor 0.020 1000
2 monmarket online 0.025 2000
3 certificate bricknmor 0.030 3000
4 brokerage online 0.070 5000
5 retirement online 0.050 9000
Example 2. Limit the number of rows to read in (first 2 rows):
<- read.table(
dat file = "accounts.txt",
header = TRUE,
nrows = 2)
The imported data frame dat
will be:
account bank rate balance
1 savings bricknmor 0.020 1000
2 monmarket online 0.025 2000
Example 3. Skip the first row (no header) and limit the number of rows to read in (4 rows)
<- read.table(
dat file = "accounts.txt",
header = FALSE,
skip = 1,
nrows = 4)
The imported data frame dat
will be:
V1 V2 V3 V4
1 savings bricknmor 0.020 1000
2 monmarket online 0.025 2000
3 certificate bricknmor 0.030 3000
4 brokerage online 0.070 5000
Example 4. Skip importing the second and third columns
<- read.table(
dat file = "accounts.txt",
header = TRUE,
colClasses = c(
"character",
"NULL",
"NULL",
"numeric"))
The imported data frame dat
will be:
account balance
1 savings 1000
2 monmarket 2000
3 certificate 3000
4 brokerage 5000
5 retirement 9000
17.3.3 Reading comma-separated files with read.csv()
Let’s now consider a data table stored in a CSV file: accounts.csv
account,bank,rate,balance
savings,bricknmor,0.020,1000
monmarket,online,0.025,2000
certificate,bricknmor,0.030,3000
brokerage,online,0.070,5000
retirement,online,0.050,9000
Example 1. Data in comma separated value (CSV) file
We can use read.table()
# using read.table()
<- read.table(
dat file = "accounts.csv",
header = TRUE,
sep = ",")
Or more conveniently, we can use read.csv()
# using read.csv()
<- read.csv(file = "accounts.csv") dat
The imported data frame dat
will be:
account bank rate balance
1 savings bricknmor 0.020 1000
2 monmarket online 0.025 2000
3 certificate bricknmor 0.030 3000
4 brokerage online 0.070 5000
5 retirement online 0.050 9000
17.4 Importing with scan()
To finish this chapter, I want to briefly talk to you about the scan()
function. This is another function that you can use to read data into a vector or list from the console or from a file. As a matter of fact, scan()
is under the hood of read.table()
and friends. Because of this, I often refer to scan()
as a low-level function.
Let’s see an example for how to use scan()
with a space delimited file. Suppose that we have the data in the file accounts.txt
, located in our working directory.
account bank rate balance
savings bricknmor 0.020 1000
monmarket online 0.025 2000
certificate bricknmor 0.030 3000
brokerage online 0.070 5000
retirement online 0.050 9000
scan()
is what I call a “low-level function”. What I mean by that is that we typically need to write more code when using this kind of functions. To be more precise, we usually have to specify more arguments when calling scan()
than when calling other high-level functions like read.table()
and friends.
Like all other data-import functions, the first argument of scan()
is the the name of the file (i.e. file path) to be imported. The second argument is called what
, and this has to do with the type of data to be read.
If you look at the content of accounts.txt
, the first line contains the column names: account bank rate balance
. Because the data in this first line is to be used as names, their data-type is "character"
.
The rest of the lines have to do with the actual data of the accounts. Each line contains four pieces of data:
the kind of
account
, to be encoded as"character"
the kind of
bank
, to be encoded also as"character"
the
rate
of return, to be encoded as"double"
(or"numeric"
)the
balance
amount, to be encoded as"integer"
or if you prefer as"double"
(or generically as"numeric"
)
In order to import the accounts.txt
data into R, we must use a two-step process:
first we import the column names into a character vector
then we import the rest of the lines into a list
Here is how to do it:
# step 1) scan column names
# ins a character vector
= scan(
header file = "accounts.txt",
what = list("", "", "", ""),
nmax = 1)
header
[1] "account" "bank" "rate" "balance"
# step 2) scan the rest of the lines
# into a list
= scan(
dat_list file = "accounts.txt",
what = list(character(), character(), double(), integer()),
skip = 1)
dat_list
[[1]]
[1] "savings" "monmarket" "certificate" "brokerage" "retirement"
[[2]]
[1] "bricknmor" "online" "bricknmor" "online" "online"
[[3]]
[1] 0.020 0.025 0.030 0.070 0.050
[[4]]
[1] 1000 2000 3000 5000 9000
Once we have the vector of column names header
and the list with the data for each of the columns, we can then assemble the data into a data table using the as.data.frame()
function as follows:
= as.data.frame(dat_list, col.names = header)
dat dat
account bank rate balance
1 savings bricknmor 0.020 1000
2 monmarket online 0.025 2000
3 certificate bricknmor 0.030 3000
4 brokerage online 0.070 5000
5 retirement online 0.050 9000
17.5 In Summary
When importing a data table, you should keep in mind the following considerations.
What is the field separator?
space
" "
tab
"\t"
comma
","
semicolon
";"
other?
Does the data file contains:
row names?
column names?
missing values?
special characters?
So far …
There are multiple ways to import data tables
The workhorse function is
read.table()
But you can use the other wrappers, e.g.
read.csv()
The output is a
"data.frame"
object