5 functions to do Correspondence Analysis in R

Posted on July 19, 2012

In a previous post, I talked about five different ways to do Principal Components Analysis in R

PCA is very useful and is one of the most applied multivariate techniques. However, PCA is limited to quantitative information. But what if our data comes in the form of qualitative information such as categorical data? The solution: Correspondence Analysis.

Correspondence Analysis, briefly CA, is one of the cousins of Principal Component Analysis. Both CA and PCA are multivariate techniques that help us to summarize the systematic patterns of variations in the data. The difference between CA and PCA is that CA applies to categorical (i.e. qualitative) data instead of continuous (i.e. quantitative) data. More specifically, CA applies to categorical data in the form of contingency tables (aka cross-tabulation). Since CA is conceptually similar to PCA, we can use it, among other things, for visualizing multidimensional data into a lower dimensional space.

CA in R

In R, there are several functions from different packages that allow us to apply Correspondence Analysis. In this post I’ll show you 5 different ways to perform CA using the following functions (with their corresponding packages in parentheses):

ca() (ca)
CA() (FactoMineR)
dudi.coa() (ade4)
afc() (amap)
corresp() (MASS)

As in PCA, no matter what function you decide to use for CA, the typical results should consist of a set of eigenvalues, a table with the row coordinates, and a table with the column coordinates. The eigenvalues provide information of the variability in the data. The row coordinates provide information about the structure of the rows in the analyzed table. The column coordinates provide information about the structure of the columns in the analyzed table.

The Data

We’ll use the dataset author that already comes with the R package "ca". It’s a data matrix containing the counts of the 26 letters of the alphabet (columns of matrix) for 12 different novels (rows of matrix). Each row contains letter counts in a sample of text from each work, excluding proper nouns.

Option 1: using ca

The function ca() comes in the package of the same name ca by Michael Greenacre and Oleg Nenadic. I personally like this package because of Greenacre’s work and books about CA. In addition, it has a very nice function to plot results in 3D (plot3d.ca())

# CA with function ca
library(ca)

# apply ca
ca1 = ca(author)

# sqrt of eigenvalues
ca1$sv

##  [1] 0.08754 0.06073 0.04910 0.03719 0.03165 0.02689 0.02566 0.02133
##  [9] 0.01934 0.01622 0.01064

# row coordinates
head(ca1$rowcoord)

##          [,1]     [,2]    [,3]    [,4]     [,5]    [,6]    [,7]    [,8]
## [1,] -0.09539 -0.79500  1.0285  0.6472  1.13597 -0.3149 -0.9908  0.7942
## [2,]  0.40570 -0.40556 -0.9304  0.6344 -0.08487 -1.7388 -1.0871  0.9097
## [3,]  1.15780 -0.02311  0.3551  1.0639 -1.33880  1.0157  0.3715  1.3790
## [4,] -0.17390  0.43444  2.0871 -0.2991  0.96081 -0.5655  0.3561  0.3814
## [5,] -0.83189 -0.13648 -0.9437  1.2758  1.11671  2.2559 -0.2008 -0.1684
## [6,]  0.30203  2.70760 -0.1815  0.7055  0.30494 -0.4697 -0.3644 -1.3977
##          [,9]   [,10]   [,11]
## [1,] -1.64567  1.6411  0.4328
## [2,] -0.36151 -1.8976 -1.0833
## [3,]  1.15739  0.9237 -1.0610
## [4,]  1.53050 -1.0166  0.7930
## [5,]  0.03299 -1.1202  0.5232
## [6,] -0.20322  0.6567 -0.5689

# column coordinates
head(ca1$colcoord)

##          [,1]    [,2]    [,3]    [,4]    [,5]    [,6]     [,7]     [,8]
## [1,]  0.01762 -0.3203 -0.2704  0.6478 -0.4436  0.7176  0.27348 -1.09591
## [2,]  0.98446 -0.3980  1.1490 -1.6202 -0.4042 -1.8570 -1.88405 -1.70703
## [3,]  2.11503 -1.3734 -1.1309 -0.9539 -1.0867 -0.2066  0.57144 -2.03364
## [4,] -1.92563 -1.1354  0.2934 -0.2390 -0.1604 -0.5917  0.41892  1.52674
## [5,]  0.08672 -0.6848  1.0038  0.3656 -0.1529  0.3979  0.07591  0.37646
## [6,]  1.27653 -0.7330 -0.3865 -0.4409  1.5577 -3.0251  0.28235 -0.01798
##         [,9]   [,10]    [,11]
## [1,]  0.3149  0.1593  0.05681
## [2,]  1.0259  0.8911 -1.32001
## [3,] -1.2373 -0.5709 -0.84705
## [4,] -1.6798 -0.1750 -0.93757
## [5,] -1.0090 -0.1227  0.40216
## [6,]  1.6629 -2.2692 -2.62764

# plot
plot(ca1)

Option 2: using CA

One of my favorite options is the CA() function from the packageFactoMineR. What I like is that this function provides many more detailed results and assessing tools. It also comes with a number of parameters that allow you to tweak the analysis in a very nice way.

# CA with function CA
library(FactoMineR)

# apply CA
ca2 = CA(author, graph = FALSE)

# matrix with eigenvalues
ca2$eig

##        eigenvalue percentage of variance cumulative percentage of variance
## dim 1   0.0076639                40.9070                             40.91
## dim 2   0.0036883                19.6870                             60.59
## dim 3   0.0024112                12.8702                             73.46
## dim 4   0.0013828                 7.3811                             80.85
## dim 5   0.0010017                 5.3465                             86.19
## dim 6   0.0007233                 3.8609                             90.05
## dim 7   0.0006586                 3.5154                             93.57
## dim 8   0.0004548                 2.4278                             96.00
## dim 9   0.0003739                 1.9958                             97.99
## dim 10  0.0002631                 1.4041                             99.40
## dim 11  0.0001132                 0.6042                            100.00

# row coordinates
head(ca2$col$coord)

##       Dim 1    Dim 2    Dim 3     Dim 4     Dim 5
## a  0.001543 -0.01945  0.01328 -0.024088  0.014040
## b  0.086183 -0.02417 -0.05642  0.060250  0.012792
## c  0.185157 -0.08341  0.05553  0.035473  0.034392
## d -0.168576 -0.06895 -0.01441  0.008886  0.005076
## e  0.007592 -0.04159 -0.04929 -0.013595  0.004841
## f  0.111752 -0.04451  0.01898  0.016396 -0.049299

# column coordinates
head(ca2$row$coord)

##                                  Dim 1     Dim 2     Dim 3    Dim 4
## three daughters (buck)       -0.008351 -0.048282 -0.050505 -0.02407
## drifters (michener)           0.035516 -0.024630  0.045687 -0.02359
## lost world (clark)            0.101358 -0.001404 -0.017436 -0.03956
## east wind (buck)             -0.015224  0.026384 -0.102483  0.01112
## farewell to arms (hemingway) -0.072826 -0.008289  0.046339 -0.04744
## sound and fury 7 (faulkner)   0.026440  0.164437  0.008911 -0.02624
##                                  Dim 5
## three daughters (buck)       -0.035952
## drifters (michener)           0.002686
## lost world (clark)            0.042372
## east wind (buck)             -0.030409
## farewell to arms (hemingway) -0.035343
## sound and fury 7 (faulkner)  -0.009651

# plot
plot(ca2)

Option 3: using dudi.coa

Another option to perform CA is by using the function dudi.coa()> that comes with the package ade4 (remember to install the package first).

# CA with function dudi.coa
library(ade4)

# apply ca
ca3 = dudi.coa(author, nf = 5, scannf = FALSE)

# sqrt of eigenvalues
ca3$eig

##  [1] 0.0076639 0.0036883 0.0024112 0.0013828 0.0010017 0.0007233 0.0006586
##  [8] 0.0004548 0.0003739 0.0002631 0.0001132

# row coordinates
head(ca3$li)

##                                  Axis1     Axis2     Axis3    Axis4
## three daughters (buck)        0.008351  0.048282  0.050505 -0.02407
## drifters (michener)          -0.035516  0.024630 -0.045687 -0.02359
## lost world (clark)           -0.101358  0.001404  0.017436 -0.03956
## east wind (buck)              0.015224 -0.026384  0.102483  0.01112
## farewell to arms (hemingway)  0.072826  0.008289 -0.046339 -0.04744
## sound and fury 7 (faulkner)  -0.026440 -0.164437 -0.008911 -0.02624
##                                  Axis5
## three daughters (buck)       -0.035952
## drifters (michener)           0.002686
## lost world (clark)            0.042372
## east wind (buck)             -0.030409
## farewell to arms (hemingway) -0.035343
## sound and fury 7 (faulkner)  -0.009651

# column coordinates
head(ca3$co)

##       Comp1   Comp2    Comp3     Comp4     Comp5
## a -0.001543 0.01945 -0.01328 -0.024088  0.014040
## b -0.086183 0.02417  0.05642  0.060250  0.012792
## c -0.185157 0.08341 -0.05553  0.035473  0.034392
## d  0.168576 0.06895  0.01441  0.008886  0.005076
## e -0.007592 0.04159  0.04929 -0.013595  0.004841
## f -0.111752 0.04451 -0.01898  0.016396 -0.049299

Option 4: using afc

Another option is to use the afc() function from the package amap (remember to install it first).

# PCA with function afc
library(amap)

# apply CA
ca4 = afc(author)

# eigenvalues
ca4$eig

##  [1] 1.842e-01 1.346e-01 1.036e-01 8.744e-02 6.207e-02 3.470e-02 3.470e-02
##  [8] 2.985e-02 2.985e-02 2.419e-02 1.771e-02 4.288e-03 2.034e-09 4.690e-10
## [15] 4.690e-10 9.408e-10 6.620e-10 6.620e-10 3.303e-10 3.303e-10 7.661e-10
## [22] 3.776e-10 3.776e-10 1.919e-10 1.919e-10 1.337e-10

# row coordinates
head(ca4$scores)

##                               Comp 1   Comp 2  Comp 3  Comp 4  Comp 5
## three daughters (buck)        0.4995  0.79652 -0.3461 0.13097 -0.1440
## drifters (michener)          -0.2646 -0.80085 -0.1714 0.07223 -0.4803
## lost world (clark)           -1.3028 -0.53461 -0.3125 0.22787 -0.3724
## east wind (buck)              0.4863  0.02159 -0.4983 0.01119 -0.2621
## farewell to arms (hemingway)  1.3600  0.59080  0.4792 0.46053  0.4875
## sound and fury 7 (faulkner)   0.9175 -0.36445 -0.9206 0.65150  0.2866
##                                Comp 6   Comp 7   Comp 8   Comp 9  Comp 10
## three daughters (buck)       -0.01648 -0.01648  0.07180  0.07180 -0.22287
## drifters (michener)           0.19286  0.19286 -0.18403 -0.18403  0.24843
## lost world (clark)           -0.12196 -0.12196  0.14200  0.14200 -0.18316
## east wind (buck)             -0.32069 -0.32069  0.35655  0.35655 -0.33860
## farewell to arms (hemingway) -0.05868 -0.05868  0.06305  0.06305 -0.11130
## sound and fury 7 (faulkner)  -0.27843 -0.27843  0.14867  0.14867 -0.07467
##                               Comp 11  Comp 12  Comp 13  Comp 14  Comp 15
## three daughters (buck)       -0.17961 -0.03171 -0.14867  0.01262  0.01262
## drifters (michener)           0.11556  0.11611 -0.04248 -0.09915 -0.09915
## lost world (clark)           -0.09069  0.09037  0.14241  0.10192  0.10192
## east wind (buck)             -0.38149 -0.13807 -0.22816  0.03239  0.03239
## farewell to arms (hemingway) -0.09531 -0.08273 -0.06196  0.01321  0.01321
## sound and fury 7 (faulkner)   0.02458  0.21285  0.28994  0.02993  0.02993
##                               Comp 16   Comp 17   Comp 18  Comp 19
## three daughters (buck)       -0.04379 -0.062066 -0.062066  0.07056
## drifters (michener)          -0.27515 -0.055491 -0.055491  0.10012
## lost world (clark)            0.02816  0.074598  0.074598 -0.09971
## east wind (buck)              0.03484  0.004923  0.004923  0.02259
## farewell to arms (hemingway) -0.05800  0.020237  0.020237 -0.01491
## sound and fury 7 (faulkner)   0.10525  0.136355  0.136355 -0.14655
##                               Comp 20   Comp 21  Comp 22  Comp 23  Comp 24
## three daughters (buck)        0.07056  0.048720  0.10302  0.10302  0.07405
## drifters (michener)           0.10012 -0.071811  0.10479  0.10479  0.09259
## lost world (clark)           -0.09971 -0.011493 -0.03282 -0.03282 -0.02379
## east wind (buck)              0.02259  0.006737  0.06182  0.06182  0.05335
## farewell to arms (hemingway) -0.01491 -0.046687  0.02372  0.02372 -0.01116
## sound and fury 7 (faulkner)  -0.14655 -0.002282 -0.14921 -0.14921 -0.04634
##                               Comp 25  Comp 26
## three daughters (buck)        0.07405 -0.06265
## drifters (michener)           0.09259 -0.07276
## lost world (clark)           -0.02379 -0.05611
## east wind (buck)              0.05335 -0.06700
## farewell to arms (hemingway) -0.01116 -0.03241
## sound and fury 7 (faulkner)  -0.04634  0.03541

# column coordinates
head(ca4$loadings)

##      Comp 1   Comp 2    Comp 3   Comp 4   Comp 5    Comp 6    Comp 7
## a  0.003022 0.020612  0.061492  0.06195  0.05277  0.043411  0.043411
## b -0.092480 0.008901 -0.091345 -0.07467 -0.08660 -0.022569 -0.022569
## c -0.178710 0.055824  0.098704  0.13198  0.15376  0.064302  0.064302
## d  0.073138 0.003039  0.156074 -0.34174 -0.29959  0.269560  0.269560
## e -0.019221 0.012044 -0.002028 -0.02439 -0.18621  0.007124  0.007124
## f -0.106717 0.055966 -0.071859  0.04725 -0.06634 -0.096594 -0.096594
##     Comp 8   Comp 9  Comp 10  Comp 11  Comp 12  Comp 13 Comp 14 Comp 15
## a -0.06499 -0.06499  0.07583  0.06867 -0.09411  0.20568  0.4721  0.4721
## b -0.03962 -0.03962  0.08500  0.10156  0.28316 -0.03667 -0.2147 -0.2147
## c -0.15066 -0.15066  0.26830  0.40201 -0.01905  0.35904  0.1227  0.1227
## d -0.19360 -0.19360  0.08525  0.15203  0.26842  0.03422 -0.2050 -0.2050
## e  0.04881  0.04881 -0.17418 -0.11241 -0.13982 -0.07567  0.1211  0.1211
## f  0.07174  0.07174  0.19381  0.06160  0.28309  0.02290 -0.1100 -0.1100
##    Comp 16 Comp 17 Comp 18 Comp 19 Comp 20  Comp 21 Comp 22 Comp 23
## a -0.28537  0.2721  0.2721 -0.4497 -0.4497 -0.76831 -0.4205 -0.4205
## b  0.11343 -0.1243 -0.1243  0.1913  0.1913  0.22611  0.1029  0.1029
## c  0.01782  0.1082  0.1082 -0.1314 -0.1314 -0.01929 -0.0811 -0.0811
## d -0.34191 -0.3789 -0.3789  0.3926  0.3926  0.11445  0.2280  0.2280
## e -0.07271  0.1265  0.1265 -0.1754 -0.1754 -0.06055 -0.1862 -0.1862
## f -0.38806 -0.2324 -0.2324  0.2252  0.2252  0.04042  0.2155  0.2155
##    Comp 24  Comp 25  Comp 26
## a -0.17789 -0.17789  0.24034
## b  0.01092  0.01092 -0.07016
## c -0.01783 -0.01783  0.11984
## d -0.03367 -0.03367 -0.17324
## e -0.04367 -0.04367  0.09777
## f  0.15196  0.15196 -0.16994

# plot
plot(ca4)

Option 5: using corresp

A fifth possibility is the corresp() function from the package MASS.

# CA with function corresp
library(MASS)

# apply CA
ca5 = corresp(author, nf = 5)

# sqrt of eigenvalues
ca5$cor

## [1] 0.08754 0.06073 0.04910 0.03719 0.03165

# row coordinates
head(ca5$rscore)

##                                  [,1]     [,2]    [,3]    [,4]     [,5]
## three daughters (buck)       -0.09539 -0.79500  1.0285  0.6472  1.13597
## drifters (michener)           0.40570 -0.40556 -0.9304  0.6344 -0.08487
## lost world (clark)            1.15780 -0.02311  0.3551  1.0639 -1.33880
## east wind (buck)             -0.17390  0.43444  2.0871 -0.2991  0.96081
## farewell to arms (hemingway) -0.83189 -0.13648 -0.9437  1.2758  1.11671
## sound and fury 7 (faulkner)   0.30203  2.70760 -0.1815  0.7055  0.30494

# column coordinates
head(ca5$cscore)

##       [,1]    [,2]    [,3]    [,4]    [,5]
## a  0.01762 -0.3203 -0.2704  0.6478 -0.4436
## b  0.98446 -0.3980  1.1490 -1.6202 -0.4042
## c  2.11503 -1.3734 -1.1309 -0.9539 -1.0867
## d -1.92563 -1.1354  0.2934 -0.2390 -0.1604
## e  0.08672 -0.6848  1.0038  0.3656 -0.1529
## f  1.27653 -0.7330 -0.3865 -0.4409  1.5577

# plot
plot(ca5)

CA plot

The typical graphic in a CA analysis is to visualize the data in a two dimensional space using the first two extracted coordinates from both rows and columns. Although we could visualize the rows and the columns separately, the usual approach is to plot both in a single graphic to get an idea of the association between them. As you can tell from the displayed code chunks, most of the CA functions have their own plot command. However, we can also use the nice tools of "ggplot2". In the following example we will also use the package "stringr"

# load ggplot2
library(ggplot2)
library(stringr)

# extract only author names
authors = rownames(author)
authors = unlist(str_extract_all(authors, "\\(\\w+"))
authors = gsub("\\(", "", authors)

# create data frame with row and col coordinates
# from both the authors and the letters
aux = c(rep("authors", 12), rep("letters", 26))
name = c(authors, colnames(author))
auth_lets = data.frame(
  name, aux, rbind(ca1$rowcoord[,1:2], ca1$colcoord[,1:2]))
head(auth_lets)

##        name     aux       X1       X2
## 1      buck authors -0.09539 -0.79500
## 2  michener authors  0.40570 -0.40556
## 3     clark authors  1.15780 -0.02311
## 4      buck authors -0.17390  0.43444
## 5 hemingway authors -0.83189 -0.13648
## 6  faulkner authors  0.30203  2.70760

# plot of authors and letters
ggplot(data = auth_lets, aes(x = X1, y = X2, label = name)) +
  geom_hline(yintercept = 0, colour = "gray75") +
  geom_vline(xintercept = 0, colour = "gray75") +
  geom_text(aes(colour = aux), alpha = 0.8, size = 5) +
  labs(x = "Dim 1", y = "Dim 2") +
  ggtitle("CA plot of authors - letters")

Published in categories how-to Tagged with correspondence analysis ca multivariate plot R

← previous next →

See all posts →

Visually Enforced

a blog by Gaston Sanchez