Visually Enforced

a blog by Gaston Sanchez


Failing to read tables in R

Posted on April 20, 2014

Badly disappointed with R by finding out that read.table() cannot read a file with a very simple structure… that Python is able to read.

A very simple file structure

Imagine you have some data in a text file having a very simple structure like the following one:

1::Leia::female::princess
2::Luke::male::jedi
3::R2D2::unknown::robot
4::Anakin::male::jedi
5::Amidala::female::queen

As you can tell, we only have 5 lines each one containing 4 fields: a number, a name, a gender, and a class. Notice how the fields are separated with 2 colons ::. Here’s a question for you: How would you read this data in R?


Some background

Before we start solving the previous question, let me tell you a bit about how I discovered this challenging problem.

It all started when I was reading the book Python for Data Analysis by Wes McKinney. Being an enthusiast useR, I naturally felt tempted to replicate in R some of the case studies described in Wes’ book, in particular the introductory case studies described in chapter 2.

One of the python examples required reading data from a text file that uses double colons (::) as a separator. For illustrating purposes I will use the made-up toy dataset already shown above:

1::Leia::female::princess
2::Luke::male::jedi
3::R2D2::unknown::robot
4::Anakin::male::jedi
5::Amidala::female::queen

In Python we can use the pandas library to import that kind of data without any problems. We just need to use the function read_table() and specify the separator sep='::'. But what about R?

Heartbroken by read.table()

If we try reading the data using one of the standard ways (e.g. read.table()) I’m afraid we will be terribly dissapointed:

# R doesn't like the separator '::'
bad_separator = read.table(
header = FALSE, 
text = "
1::Leia::female::princess
2::Luke::male::jedi
3::R2D2::unknown::robot
4::Anakin::male::jedi
5::Amidala::female::queen
",
col.names = c("num", "name", "gender", "class"),
sep = "::")

If you prefer, you can try reading the file fakedata.dat, which has exactly the same content as the previous example, like so:

read.table(file = "http://gastonsanchez.com/data/fakedata.dat",
           header = FALSE, 
           col.names = c("num", "name", "gender", "class"),
           sep = "::")

In both cases, you should get a heartbreaking error like this one:

"Error in scan(file, what = "", sep = sep, quote = quote, nlines = 1:
  invalid 'sep' value: must be one byte"

One byte separtors

If we had just one colon : as a separator, then R would not complain whatsoever:

# R is ok with the separator ":"
good_separator = read.table(
header = FALSE, 
text = "
1:John:26:male
2:Dan:30:male
3:Rori:31:female
4:Tracy:40:female
5:Luke:30:male
",
col.names = c("num", "name", "age", "gender"),
sep = ":")

# look at the data
good_separator
##   num  name age gender
## 1   1  John  26   male
## 2   2   Dan  30   male
## 3   3  Rori  31 female
## 4   4 Tracy  40 female
## 5   5  Luke  30   male

But the problem is that we do have two colons ::, and R accepts only one byte separators. Of course, we could use readLines() to read the data as a character vector, do some transformations to get rid of the colons, and then shape the data into a data frame. In some cases this option could be acceptable. But with the type of data like the one described in Wes’s book, using readLines() was not an elegant option for me.


My solution

The solution I found was to do the job outside of R. Yes, I know this is kind of dissapointing but it’s part of the data analysis game. Sometimes you just have go outside R to get the job done.

Using a text editor

One solution is to use a text editor (e.g. vim, emacs, text wrangler, notepad ++, sublime text, etc). You just need to use the find-replace functionality to substitute all double colons into commas:

Using sed

In my case I work on a mac so I decided to use sed (stream editor). The idea is the same as if you were using a text editor: to substitute all the double colons by a comma.

Here’s how: open the terminal, go to the directory where the data file is, and simply run the following command:

sed 's/::/,/g' fakedata.dat > newfakedata.txt

Alternatively, you can use the system() function within R to pass it the same command:

# 'sed' command from R
system("sed 's/::/,/g' fakedata.dat > newfakedata.txt")

What we’re doing is substituting all the :: by , in the file fakedata.dat and then saving the results in a new file newfakedata.dat. If we open the new file we should be able to see nice comma-separated fields:

1,Leia,female,princess
2,Luke,male,jedi
3,R2D2,unknown,robot
4,Anakin,male,jedi
5,Amidala,female,queen

which is perfectly readable in R with read.table() and/or read.csv().


Published in categories how-to  Tagged with file  read  data