Using tapply, ddply, and sqldf for the same task
Posted on June 08, 2012
Usually there’s more than one way to do things when programming with data. An interesting application is when we want to do some calculation on a quantitative variable that is associated to a qualitative variable. For instance, you have the prices (quantitative) and brands (qualitative) of different products, and you want to get the average price of those products by brand. To make things clearer let’s generate some imaginary data of commercial products:
# here's the data
item = toupper(letters[1:15])
set.seed(321)
brand = sample(c("Alpha", "Beta"), length(item), replace=TRUE)
price = round(10 * runif(length(item)), 2)
units = sample(1:3, length(item), replace=TRUE)
some_data = data.frame(item, brand, price, units)You should get a table like this one
     item   brand   price   units
1       A    Beta    2.02       1
2       B    Beta    6.33       3
3       C   Alpha    4.04       3
4       D   Alpha    2.91       3
5       E   Alpha    6.40       2
6       F   Alpha    6.36       2
7       G   Alpha    9.90       3
8       H   Alpha    9.31       1
9       I   Alpha    4.86       3
10      J    Beta    5.75       2
11      K    Beta    7.51       3
12      L   Alpha    9.93       1
13      M    Beta    4.31       2
14      N   Alpha    1.24       3
15      O    Beta    5.94       3How can we obtain the average price by brand? Well, it depends on what do you want to use. Here are some options:
- using a forloop (not recommendable)
- using boolean selection
- using tapply()
- using ddply()from package"plyr"
- using a sql query (with package "sqldf")
Option 1: using a for loop
# Naive option: don't do this!!!
n_alpha = 0
n_beta = 0
avg_alpha = 0
avg_beta = 0
for (i in 1:nrow(some_data)) {
    if (some_data$brand[i] == "Alpha") {
        avg_alpha = avg_alpha + some_data$price[i]
        n_alpha = n_alpha + 1
    } else {
        avg_beta = avg_beta + some_data$price[i]
        n_beta = n_beta + 1
    }
}
cat("Average Alpha:", avg_alpha / n_alpha,
"\nAverage Beta:", avg_beta / n_beta)Average Alpha: 6.105556
Average Beta: 5.31Option 2: using boolean selection
# using boolean selection
with(some_data, mean(price[brand=="Alpha"]))
with(some_data, mean(price[brand=="Beta"]))6.105556
5.31Option 3: using tapply()
# using tapply
with(some_data, tapply(price, brand, mean))  Alpha Beta
6.105556 5.310000Option 4: using ddply</strong>
# using ddply
require(plyr)
ddply(some_data, .(brand), summarise, mean_price=mean(price))  brand mean_price
1 Alpha 6.105556
2 Beta 5.310000Option 5: using an SQL query
# Using an SQL query
require(sqldf)
sqldf("SELECT brand, AVG(price) AS mean_price FROM some_data GROUP BY brand")  brand mean_price
1 Alpha 6.105556
2 Beta 5.310000