5 functions to do Multiple Correspondence Analysis in R
Posted on October 13, 2012
Today is the turn to talk about five different options of doing Multiple Correspondence Analysis in R (don’t confuse it with Correspondence Analysis).
Put in very simple terms, Multiple Correspondence Analysis (MCA) is to qualitative data, as Principal Component Analysis (PCA) is to quantitative data. Well, maybe I’m oversimplifying a little bit because MCA has some special features that make it mathematically different from PCA, but they both share a lot of things in common from a data analysis standpoint.
As with PCA and Correspondence Analysis, MCA is just another tool in our kit of multivariate methods that allows us to analyze the systematic patterns of variations with categorical data. Keep in mind that MCA applies to tables in which the observations are described by a set of qualitative (i.e. categorical) variables. This means that in R you must have your table in the form of a data frame with factors (observations in the rows, qualitative variables in the columns).
MCA in R
In R, there are several functions from different packages that allow us to apply Multiple Correspondence Analysis. In this post I’ll show you 5 different ways to perform MCA using the following functions (with their corresponding packages in parentheses):
MCA() (FactoMineR)
mca() (MASS)
dudi.acm() (ade4)
mjca() (ca)
homals() (homals)
No matter what function you decide to use for MCA, the typical results should consist of a set of eigenvalues, a table with the row coordinates, and a table with the column coordinates.
Compared to the eigenvalues obtained from a PCA or a CA, the eigenvalues in a MCA can be much more smaller. This is important to know because if you just consider the eigenvalues, you might be tempted to conclude that MCA sucks. Which is absolutely false.
Personally, I think that the real meat and potatoes of MCA relies in its dimension reduction properties that let us visualize our data, among other things. Besides the eigenvalues, the row coordinates provide information about the structure of the rows in the analyzed table. In turn, the column coordinates provide information about the structure of the analyzed variables and their corresponding categories.
The Data
We’ll use the dataset tea that comes in the R package "FactoMineR" . It’s a data frame (of factors) containing the answers of a questionnaire on tea consumption for 300 individuals. Although the data contains 36 columns (i.e. variables), for demonstration purposes I will only consider the following columns:
What kind of tea do you drink (black, green, flavored)
How do you drink it (alone, w/milk, w/lemon, other)
What kind of presentation do you buy (tea bags, loose tea, both)
Do you add sugar (yes, no)
Where do you buy it (supermarket, shops, both)
Do you always drink tea (always, not always)
Option 1: using MCA()
My preferred function to do multiple correspondence analysis is the MCA() function that comes in the fabulous package "FactoMineR" by Francois Husson, Julie Josse, Sebastien Le, and Jeremy Mazet. If you have seen my other posts you’ll know that this is one of favorite packages and I strongly recommend other users to seriously take a look at it. It provides the most complete list of results with different calculations for interpretation and diagnosis.
We can use the package "ggplot2()" to get a nice plot:
In order to have a more interesting representation, we could superimpose a graphic display of both the observations and the categories. Moreover, since some individuals will be overlapped, we can add some density curves with geom_density2d() to see those zones that are highly concentrated:
Option 2: using mca()
Another function for performing MCA is the mca() function that comes in the "MASS" package by Brian Ripley et al.
We can get an MCA plot of variables:
If you prefer not to use "ggplot2", you can stay with the default plots (not for me)
Option 3: using dudi.acm()
A third option to perform MCA is by using the function dudi.acm() that comes with the package "ade4" by Simon Penel et al (remember to install the package first).
Here’s how to get the MCA plot of variables with ggplot()
Option 4: using mjca()
Another interesting way for carrying out MCA is by using the function mjca() from the package "ca" by Michael Greenacre and Oleg Nenadic.
We’ll use the column coordinates colcoord to make a data frame and pass it to ggplot():
Option 5: using homals()
A fifth possibility is the homals() function from the package "homals" by Jan de Leeuw and Patrick Mair.
In order to get the MCA plot of variables, we first need to unlist the coordinates of the categories before creating the data frames for ggplot():