4 ways to read a file in R... by columns
Posted on June 23, 2012
Ever wonder how to read a file in R by columns? This question comes to mind when your analysis doesn’t require to import all the data in R, especially if the file is huge.
Sometimes you just want to read some columns, do some data manipulation, and plot some graphics. How can you do that in R? I’ll show you four different ways to do that without having to use a data base management system (DBMS) and SQL queries.
Toy example
For this post let’s consider a toy dataset of 12 rows and 7 columns in csv (comma-separated value) format. For instance, a dataset like the following one:
# create toy dataset
dataset = data.frame(
   Id = paste("id", 1:12, sep = ""),
   Name = toupper(letters[1:12]),
   Month = month.name,
   Num = 11:22,
   Qty = sample(0:3, 12, replace = TRUE),
   Val = 10 * runif(12, 0.5, 1),
   Col = colors()[seq(1, 120, 10)]
)
# save data in csv file
write.csv(dataset, "some_file.csv", row.names = FALSE)##  Id   Name Month     Num Qty Val      Col
##  id1  A    January   11  1   6.034408 white
##  id2  B    February  12  1   5.242792 aquamarine3
##  id3  C    March     13  3   7.296390 bisque2
##  id4  D    April     14  2   8.792603 blueviolet
##  id5  E    May       15  1   5.161514 burlywood4
##  id6  F    June      16  3   8.153676 chartreuse4
##  id7  G    July      17  0   9.788158 coral4
##  id8  H    August    18  0   7.106729 cyan3
##  id9  I    September 19  0   9.969363 darkgreen
##  id10 J    October   20  2   7.117246 darkorange1
##  id11 K    November  21  2   7.536335 darksalmon
##  id12 L    December  22  1   6.374395 darkslategray3Option 1: cut and system
The first option consists of using a cut 
command with the desired columns, and calling this command within the system() 
function. The only “problem” is that the data will be stored in a vector. It is not 
the best solution if what you want is a data frame, but it can do the trick if you 
want to quickly inspect the columns.
# collect columns 1, 3 and 7
tmp1 = system("cut -f1,3,7 -d',' some_file.csv", intern = TRUE)
# remove extra quotation marks
tmp1 = gsub("\"", "", tmp1)
# split by commas (this returns a list)
list1 = strsplit(tmp1, ",")Option 2: cut and pipe
The second option is similar to the first one. It consists of calling a cut 
command but this time from the pipe() function, which in turn is contained 
inside a read.csv() function.
# read columns 1 to 3
df2a = read.csv(pipe("cut -f1-3 -d',' some_file.csv"))
# read columns 1, 3 and 7
df2b = read.csv(pipe("cut -f1,3,7 -d',' some_file.csv"))Option 3: package colbycol
The third option consists of using the very handy function cbc.read.table() that 
comes with the package "colbycol" (by Carlos Gil)
# remember to install colbycol
require(colbycol)
# read desired columns
tmp3 = cbc.read.table("some_file.csv", just.read = c(1, 3, 7), sep = ",")
# convert to a data.frame
df3 = as.data.frame(tmp3)Option 4: package limma
The last option consists of using the function read.columns() that comes with 
the "limma" package (by Gordon Smyth et al). Just a small detail: "limma" 
is in Bioconductor, 
not in CRAN. In this case, you need to specify the names of the columns to be read.
# install limma from bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite("limma")
# load limma
library(limma)
# read columns 1=Id, 3=Name, and 7=Col
df4 = read.columns("some_file.csv", c("Id", "Name", "Col"), sep = ",")Happy data analysis!