Visually Enforced

a blog by Gaston Sanchez


Using tapply, ddply, and sqldf for the same task

Posted on June 08, 2012

Usually there’s more than one way to do things when programming with data. An interesting application is when we want to do some calculation on a quantitative variable that is associated to a qualitative variable. For instance, you have the prices (quantitative) and brands (qualitative) of different products, and you want to get the average price of those products by brand. To make things clearer let’s generate some imaginary data of commercial products:

# here's the data
item = toupper(letters[1:15])

set.seed(321)
brand = sample(c("Alpha", "Beta"), length(item), replace=TRUE)
price = round(10 * runif(length(item)), 2)
units = sample(1:3, length(item), replace=TRUE)
some_data = data.frame(item, brand, price, units)

You should get a table like this one

     item   brand   price   units
1       A    Beta    2.02       1
2       B    Beta    6.33       3
3       C   Alpha    4.04       3
4       D   Alpha    2.91       3
5       E   Alpha    6.40       2
6       F   Alpha    6.36       2
7       G   Alpha    9.90       3
8       H   Alpha    9.31       1
9       I   Alpha    4.86       3
10      J    Beta    5.75       2
11      K    Beta    7.51       3
12      L   Alpha    9.93       1
13      M    Beta    4.31       2
14      N   Alpha    1.24       3
15      O    Beta    5.94       3

How can we obtain the average price by brand? Well, it depends on what do you want to use. Here are some options:

  • using a for loop (not recommendable)
  • using boolean selection
  • using tapply()
  • using ddply() from package "plyr"
  • using a sql query (with package "sqldf")

Option 1: using a for loop

# Naive option: don't do this!!!
n_alpha = 0
n_beta = 0
avg_alpha = 0
avg_beta = 0
for (i in 1:nrow(some_data)) {
    if (some_data$brand[i] == "Alpha") {
        avg_alpha = avg_alpha + some_data$price[i]
        n_alpha = n_alpha + 1
    } else {
        avg_beta = avg_beta + some_data$price[i]
        n_beta = n_beta + 1
    }
}
cat("Average Alpha:", avg_alpha / n_alpha,
"\nAverage Beta:", avg_beta / n_beta)
Average Alpha: 6.105556
Average Beta: 5.31

Option 2: using boolean selection

# using boolean selection
with(some_data, mean(price[brand=="Alpha"]))
with(some_data, mean(price[brand=="Beta"]))
6.105556
5.31

Option 3: using tapply()

# using tapply
with(some_data, tapply(price, brand, mean))
  Alpha Beta
6.105556 5.310000

Option 4: using ddply</strong>

# using ddply
require(plyr)
ddply(some_data, .(brand), summarise, mean_price=mean(price))
  brand mean_price
1 Alpha 6.105556
2 Beta 5.310000

Option 5: using an SQL query

# Using an SQL query
require(sqldf)
sqldf("SELECT brand, AVG(price) AS mean_price FROM some_data GROUP BY brand")
  brand mean_price
1 Alpha 6.105556
2 Beta 5.310000

Published in categories how-to  Tagged with ddply  factor  query  sqldf  tapply