# Using Principal Component Analysis to Estimate Socioeconomic Status

## 2020/03/21

Categories: technical Tags: r

## Introduction

Principal Component Analysis (PCA) is a dimension reduction technique that is commonly used in the social sciences to build single dimension estimates of socioeconomic status. In this post I will first walk through the intuition and math behind PCA, and then provide a simple example of its use in R to generate a simple asset index.

## Intuition

Dimension reduction is any process by which we reduce the number of variables in a dataset. For example, in the following plot the figure on the left has three dimensions: x, y, and the color of the points, while the figure on the right reduces the number of dimensions to two by dropping the y variable.

Dimension reduction is quite useful when we would like to summarize or plot data with many dimensions. For example, we may want to summarize the relative weatlh of a household using a dataset of the household’s assets. However, we can clearly see that dimension reduction may be difficult to do without information loss: in the preceeding plot we can see that the x variable alone is not enough to distinguish the two classes.

Ideally we would like to perform dimension reduction on our data while limiting the amount of information lost, if only there were some way to look at the principal components of the data. PCA achieves dimension reduction by taking a dataset and creating a set of uncorrelated components which capture most of the variance. The following plot shows a single dimensional plot of the first principal component, note that the groups are now distinct.

## A Little Bit of Math

Calculating the components in PCA is relatively straightfoward, as each component is just a linear combination of the data itself. To keep the components uncorrelated each component subtracts out the sum of all previous components. That is, the components can be written as:

\begin{align*} \text{pc}_1 &= (w^{(1)}_1x_1 + w^{(1)}_2x_2 + \ldots + w^{(1)}_px_p)^2 \\ \text{pc}_2 &= (w^{(2)}_1(x_1 - w^{(1)}_1x_1) + w^{(2)}_2(x_2 - w^{(1)}_2x_2) + \ldots + w^{(2)}_p(x_p-w^{(1)}_px_p))^2\\ &\cdots \\\ \text{pc}_k &= \left(\left(\mathbf{x} - \sum_{s<k} \text{pc}_s\right) \cdot w^{(k)}\right)^2 \\ \end{align*}

Where each $$\mathbf{w}$$ is required to be a unit vector. The process used to find the actual $$\mathbf{w}$$ values involves is straightforward, but a requires more linear algebra than I want to present here. It is well-explained on Wikipedia if you are comfortable with the concept of eigenvectors.

## In R

First we need some data. I’ll just create a simple dataset containing some simple normalized assets and an indicator for poor and wealthy households.

library(dplyr)
library(ggplot2)
library(ggthemes)

n <- 2000

df <- tibble(
wealthy = rep(c(0, 1), each = n/2),
motorbikes = rbinom(n, 5, 0.25 + 0.2*wealthy),
bicycles = rbinom(n, 5, 0.65 + 0.2*wealthy),
chickens = rbinom(n, 5, 0.75 + 0.2*wealthy),
goats = rbinom(n, 5, 0.5 + 0.2*wealthy),
savings = rnorm(n, 0 + 0.2*wealthy),
land_area = rgamma(n, 1.5 + 0.25*wealthy),
)

We only need a few lines of R to extract the principal components for a dataset.

# Calculate and extract the components
pr <- prcomp(select(df, -wealthy))
pr <- predict(pr, select(df, -wealthy))

# Add the first two components to the data
df$pc1 <- pr[ , 1] df$pc2 <- pr[ , 2]
df\$pc3 <- pr[ , 2]

And that’s it. We can now plot the components to see how well they separate the data.

df %>%
ggplot(aes(pc1, pc2, color = as.factor(wealthy))) +
geom_point() +
scale_color_few()

Note that the first component on its own does a pretty good job separating the groups. Density plots of the principal components can also be quite useful to understand how wealth is distributed; in this case, wealth is pretty normally distributed.

df %>%
select(matches("^pc")) %>%
gather(component, value) %>%
ggplot(aes(value)) +
geom_density() +
facet_wrap(. ~ component)