Visually Enforced

a blog by Gaston Sanchez


4 ways to read a file in R... by columns

Posted on June 23, 2012

Ever wonder how to read a file in R by columns? This question comes to mind when your analysis doesn’t require to import all the data in R, especially if the file is huge.

Sometimes you just want to read some columns, do some data manipulation, and plot some graphics. How can you do that in R? I’ll show you four different ways to do that without having to use a data base management system (DBMS) and SQL queries.

Toy example

For this post let’s consider a toy dataset of 12 rows and 7 columns in csv (comma-separated value) format. For instance, a dataset like the following one:

# create toy dataset
dataset = data.frame(
   Id = paste("id", 1:12, sep = ""),
   Name = toupper(letters[1:12]),
   Month = month.name,
   Num = 11:22,
   Qty = sample(0:3, 12, replace = TRUE),
   Val = 10 * runif(12, 0.5, 1),
   Col = colors()[seq(1, 120, 10)]
)

# save data in csv file
write.csv(dataset, "some_file.csv", row.names = FALSE)
##  Id   Name Month     Num Qty Val      Col
##  id1  A    January   11  1   6.034408 white
##  id2  B    February  12  1   5.242792 aquamarine3
##  id3  C    March     13  3   7.296390 bisque2
##  id4  D    April     14  2   8.792603 blueviolet
##  id5  E    May       15  1   5.161514 burlywood4
##  id6  F    June      16  3   8.153676 chartreuse4
##  id7  G    July      17  0   9.788158 coral4
##  id8  H    August    18  0   7.106729 cyan3
##  id9  I    September 19  0   9.969363 darkgreen
##  id10 J    October   20  2   7.117246 darkorange1
##  id11 K    November  21  2   7.536335 darksalmon
##  id12 L    December  22  1   6.374395 darkslategray3

Option 1: cut and system

The first option consists of using a cut command with the desired columns, and calling this command within the system() function. The only “problem” is that the data will be stored in a vector. It is not the best solution if what you want is a data frame, but it can do the trick if you want to quickly inspect the columns.

# collect columns 1, 3 and 7
tmp1 = system("cut -f1,3,7 -d',' some_file.csv", intern = TRUE)

# remove extra quotation marks
tmp1 = gsub("\"", "", tmp1)

# split by commas (this returns a list)
list1 = strsplit(tmp1, ",")

Option 2: cut and pipe

The second option is similar to the first one. It consists of calling a cut command but this time from the pipe() function, which in turn is contained inside a read.csv() function.

# read columns 1 to 3
df2a = read.csv(pipe("cut -f1-3 -d',' some_file.csv"))

# read columns 1, 3 and 7
df2b = read.csv(pipe("cut -f1,3,7 -d',' some_file.csv"))

Option 3: package colbycol

The third option consists of using the very handy function cbc.read.table() that comes with the package "colbycol" (by Carlos Gil)

# remember to install colbycol
require(colbycol)

# read desired columns
tmp3 = cbc.read.table("some_file.csv", just.read = c(1, 3, 7), sep = ",")

# convert to a data.frame
df3 = as.data.frame(tmp3)

Option 4: package limma

The last option consists of using the function read.columns() that comes with the "limma" package (by Gordon Smyth et al). Just a small detail: "limma" is in Bioconductor, not in CRAN. In this case, you need to specify the names of the columns to be read.

# install limma from bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite("limma")

# load limma
library(limma)

# read columns 1=Id, 3=Name, and 7=Col
df4 = read.columns("some_file.csv", c("Id", "Name", "Col"), sep = ",")

Happy data analysis!


Published in categories how-to  Tagged with file  read  columns