Dimensionality Reduction is Bananas
Dimensionality reduction methods like principal component analysis (PCA) or singular vector decomposition (SVD) take high dimensional data and project it onto a lower dimensional space while retaining the maximum amount of variation. While there are many different dimensionality reduction methods, they all work based on similar principles. In this post, I'll focus on PCA, as it's commonly used in bioinformatics analyses.
A simple example of dimensionality reduction is the act of taking a picture of an object. Here I'll use a banana as an example. A banana has three dimensions: length, width, and height. A picture of a banana has only two: length and height. Therefore, taking a picture of a banana is reducing the dimensionality of the banana from three to two dimensions.
PCA is a statistical method that reduces the dimensionality of an input dataset by creating principal components (PCs). PCs are numeric vectors that are an abstract, compressed representation of the original data. The PCs are calculated using equations that compress the original data while retaining the maximum amount of variation. PCs are numbered according to how much variation from the original dataset they explain, with the first principal component explaining the most variation. Variation in a dataset can also be thought of as "patterns" in the data, or "essential features" of the dataset. Variation is what makes a dataset unique and interesting.
Another Food Motivated Example
I'll walk through an example of applying PCA and interpreting the outputs. In the example dataset below, we have records from five individuals on how much of six foods they've eaten in the last week. This dataset could be considered 'high dimensional' data, as we have six measurements (dimensions) for each sample.
A common nomenclature you may have seen used to describe datasets is in terms of an n x p matrix
n = number of samples/individuals (5)
p = number of measurements collected from those samples - 'dimensions' (6)
Running PCA on the original n x p data matrix generates two smaller matrices:
the scores matrix (n x p): contains the principal components
the loadings matrix* (p x n): contains information about how measures contribute principal components
PC1 will identify the largest source of variation between individuals in the dataset. In this example, that is meat eaters (pink) vs. not (green). Individuals who eat meat (A,B,C) will get a high numeric value for PC1, while individuals who don't (D,E) will get a low, or possibly negative, value. This is illustrated in the scores matrix on the right.In this example PC1 represents how 'meaty' an individual's diet is.
Principal Component 2PC2 will identify any 'leftover' sources of variation that weren't captured by PC1. In this example, the next most striking difference between individuals is consumption of tofu (purple) vs. not (gold). PC2 represents how tofu-heavy an individual's diet is. This is a much more subtle pattern than the first.
PC2 will identify any 'leftover' sources of variation that weren't captured by PC1. In this example, the next most striking difference between individuals is consumption of tofu (purple) vs. not (gold). PC2 represents how tofu-heavy an individual's diet is. This is a much more subtle pattern than the first.
While in this example it was easy to identify what patterns each PC represented just by looking at the data, this becomes impossible as the size of the data grows. The loadings matrix can help us understand what each PC represents, regardless of dataset size.
The loadings matrix shows how important each of the original dataset's dimensions are in the construction of the PCs. Confirming what we noted above, PC1 is mainly constructed from the beef, chicken, and pork values and PC2 from the tofu values. This matrix can also be used to find features that aren't very informative, such as carrots in this example.
What About Principal Components 3-5?
PCA will calculate as many PCs as there are samples in a dataset, but the PCs aren't all equally informative. This can be demonstrated using a Scree plot, a plot that shows what percent of total variation in the original dataset each PC explains. A scree plot is a useful tool to decide how many PCs are required to reasonable approximate your original dataset.
In this example, the first two PCs explain 70% and 16% of the total variation, respectively. PCs 3-5 only explain 14%. This result tells us that our six dimensional dataset is reasonably approximated by two dimensions, PC1 - 'meatyness' and PC 2 - 'tofu-heavy'.
In this example, the first two PCs explain 70% and 16% of the total variation, respectively. PCs 3-5 only explain 14%. This result tells us that our six dimensional dataset is reasonably approximated by two dimensions, PC1 - 'meatyness' and PC 2 - 'tofu-heavy'.
Why Use Dimensionality Reduction?
- To better understand sources of variability in the dataset- To remove batch effects of confounding variables from a downstream analysis
the first two PCs from a genomic analysis SNP variant calls from 10,000 human exomes1 (B).
original model: