Failing to read tables in R
Posted on April 20, 2014
Badly disappointed with R by finding out that
read.table() cannot read a file with a very simple structure… that Python is able to read.
A very simple file structure
Imagine you have some data in a text file having a very simple structure like the following one:
As you can tell, we only have 5 lines each one containing 4 fields: a number, a name, a gender, and a class. Notice how the fields are separated with 2 colons
::. Here’s a question for you: How would you read this data in R?
Before we start solving the previous question, let me tell you a bit about how I discovered this challenging problem.
It all started when I was reading the book Python for Data Analysis by Wes McKinney. Being an enthusiast useR, I naturally felt tempted to replicate in R some of the case studies described in Wes’ book, in particular the introductory case studies described in chapter 2.
One of the python examples required reading data from a text file that uses double colons (
::) as a separator. For illustrating purposes I will use the made-up toy dataset already shown above:
In Python we can use the pandas library to import that kind of data without any problems. We just need to use the function
read_table() and specify the separator
sep='::'. But what about R?
Heartbroken by read.table()
If we try reading the data using one of the standard ways (e.g.
read.table()) I’m afraid we will be terribly dissapointed:
If you prefer, you can try reading the file
fakedata.dat, which has exactly the same content as the previous example, like so:
In both cases, you should get a heartbreaking error like this one:
"Error in scan(file, what = "", sep = sep, quote = quote, nlines = 1: invalid 'sep' value: must be one byte"
One byte separtors
If we had just one colon
: as a separator, then R would not complain whatsoever:
But the problem is that we do have two colons
::, and R accepts only one byte separators. Of course, we could use
readLines() to read the data as a character vector, do some transformations to get rid of the colons, and then shape the data into a data frame. In some cases this option could be acceptable. But with the type of data like the one described in Wes’s book, using
readLines() was not an elegant option for me.
The solution I found was to do the job outside of R. Yes, I know this is kind of dissapointing but it’s part of the data analysis game. Sometimes you just have go outside R to get the job done.
Using a text editor
One solution is to use a text editor (e.g. vim, emacs, text wrangler, notepad ++, sublime text, etc). You just need to use the find-replace functionality to substitute all double colons into commas:
In my case I work on a mac so I decided to use sed (stream editor). The idea is the same as if you were using a text editor: to substitute all the double colons by a comma.
Here’s how: open the terminal, go to the directory where the data file is, and simply run the following command:
sed 's/::/,/g' fakedata.dat > newfakedata.txt
Alternatively, you can use the
system() function within R to pass it the same command:
What we’re doing is substituting all the
, in the file
fakedata.dat and then saving the results in a new file
newfakedata.dat. If we open the new file we should be able to see nice comma-separated fields:
which is perfectly readable in R with