Saturday, November 20, 2021

Dimensionality Reduction is Bananas: understanding the concepts without the math

Dimensionality Reduction is Bananas

Dimensionality reduction methods like principal component analysis (PCA) or singular vector decomposition (SVD) take high dimensional data and project it onto a lower dimensional space while retaining the maximum amount of variation. While there are many different dimensionality reduction methods, they all work based on similar principles. In this post, I'll focus on PCA, as it's commonly used in bioinformatics analyses.

A simple example of dimensionality reduction is the act of taking a picture of an object. Here I'll use a banana as an example. A banana has three dimensions: length, width, and height. A picture of a banana has only two: length and height. Therefore, taking a picture of a banana is reducing the dimensionality of the banana from three to two dimensions. 



Lets say the goal of your banana photography is to take a picture that is the best representation of the actual, three dimensional banana. There are many different angles from which to take a picture of a banana, some will result in better pictures than others. Compare the example two dimensional pictures below, which is most clearly recognizable as a banana?



A) is clearly as a banana while B) is less clear. In fact B) actually starts to resemble pear C) at certain angles. 

This example introduced two key concepts of dimensionality reduction:
 
     1) The goal of dimensionality reduction is to reduce the size and complexity of a dataset. A            picture is a smaller, simpler, and more convenient representation of an actual banana

     2) But, the goal is not to reduce dimensionality at all cost; we want to reduce complexity               while still retaining essential features of the dataset

To better understand the second point, lets reduce the dimensionality of the pictures even more to just one dimension: height.



While this did make our representations of a banana much simpler, it also removed essential features of the banana that were included in the pictures, for instance the length and the curvature. Now it is near impossible to distinguish between these three representations. The figure below summarizes these key concepts.


Picture #2 represents dimensionality reduction 
- it reduced the dimensionality of the original data (3D banana -> 2D photo)
- it retained key features of the original data (length, curvature)

What is PCA, in More Detail

PCA is a statistical method that reduces the dimensionality of an input dataset by creating principal components (PCs). PCs are numeric vectors that are an abstract, compressed representation of the original data. The PCs are calculated using equations that compress the original data while retaining the maximum amount of variation. PCs are numbered according to how much variation from the original dataset they explain, with the first principal component explaining the most variation. Variation in a dataset can also be thought of as "patterns" in the data, or "essential features" of the dataset. Variation is what makes a dataset unique and interesting. 

PCA is an example of 'unsupervised' learning, as you can use it to 'learn' about your data without having to know anything about your data beforehand. Common uses of PCA are to compress a dataset for a downstream analysis, or to identify patterns of biases in the dataset.

Another Food Motivated Example

I'll walk through an example of applying PCA and interpreting the outputs. In the example dataset below, we have records from five individuals on how much of six foods they've eaten in the last week. This dataset could be considered 'high dimensional' data, as we have six measurements (dimensions) for each sample.

A common nomenclature you may have seen used to describe datasets is in terms of an n x p matrix

n = number of samples/individuals (5)

p = number of measurements collected from those samples - 'dimensions' (6) 

Running PCA on the original n x p data matrix generates two smaller matrices:

the scores matrix (n x p): contains the principal components

the loadings matrix* (p x n): contains information about how measures contribute principal components

* confusingly, these matrices can go by many different names (ex: loadings matrix vs. rotation matrix). The matrices can always be identified by their shape (n x p) vs. (p x n)

The figure below shows an example of what these two matrices look like. For simplicity, the actual numeric values are represented by dots.


Instead of walking through the math of how these matrices are created, I'll demonstrate the intuition behind the first two PCs of this dataset.

Principal Component 1

PC1 will identify the largest source of variation between individuals in the dataset. In this example, that is meat eaters (pink) vs. not (green). Individuals who eat meat (A,B,C) will get a high numeric value for PC1, while individuals who don't (D,E) will get a low, or possibly negative, value. This is illustrated
 in the scores matrix on the right.In this example PC1 represents how 'meaty' an individual's diet is.

Principal Component 2

PC2 will identify any 'leftover' sources of variation that weren't captured by PC1. In this example, the next most striking difference between individuals is consumption of tofu (purple) vs. not (gold). PC2 represents how tofu-heavy an individual's diet is. This is a much more subtle pattern than the first.

While in this example it was easy to identify what patterns each PC represented just by looking at the data, this becomes impossible as the size of the data grows. The loadings matrix can help us understand what each PC represents, regardless of dataset size. 

The loadings matrix shows how important each of the original dataset's dimensions are in the construction of the PCs. Confirming what we noted above, PC1 is mainly constructed from the beef, chicken, and pork values and PC2 from the tofu values. This matrix can also be used to find features that aren't very informative, such as carrots in this example.

What About  Principal Components 3-5?

PCA will calculate as many PCs as there are samples in a dataset, but the PCs aren't all equally informative. This can be demonstrated using a Scree plot, a plot that shows what percent of total variation in the original dataset each PC explains. 
A scree plot is a useful tool to decide how many PCs are required to reasonable approximate your original dataset.

In this example, the first two PCs explain 70% and 16% of the total variation, respectively. PCs 3-5 only explain 14%. This result tells us that our six dimensional dataset is reasonably approximated by two dimensions, PC1 - 'meatyness' and PC 2 - 'tofu-heavy'.

Why Use Dimensionality Reduction?

There are two main ways in which dimensionality reduction is uses in analyses
- To better understand sources of variability in the dataset 
- To remove batch effects of confounding variables from a downstream analysis
The output of PCA can give you a deeper understanding of your data:

1) Variance explained by each PC - The scree plot will tell you if there are obvious patterns in your data. If most of the variation is explained by PC1, there is one clear pattern in the data (A). If variation is spread over many PCs, the data has no obvious trends (B). In our food example, almost all of the variability in the data could be explained by one pattern:  meat-eaters vs. vegetarians. 
2) The loadings matrix - This matrix will tell you what features explain the patterns identified by PCA. In our example, the loadings matrix indicated that consumption of beef, chicken, and pork were the key features in the original dataset that explained the most variation between individuals.

3) The principal components - Plotting the first two PCs can help you visualize how samples in your dataset relate to each other, and identify groups of samples that are similar to each other (clusters). The plots below show the first two PCs from our food data (A) and
the first two PCs from a genomic analysis SNP variant calls from 10,000 human exomes1 (B).

From our food example, you can see how individuals cluster by vegetarian vs. meat-eating diet. In the human genetics example, the largest source of variation and the driving force of PC clustering is their ancestry. 

The other most common use-case for PCA is removing batch effects or confounding variables. Imagine our example food dataset was much larger, and we were interested in finding differences in food consumption between two groups: biologists and bioinformaticians. When we analyze our PCA results, we see the biggest pattern in the data is whether individuals eat meat or not. We don't know if there is an equal number of vegetarians in both of our groups, and we aren't particularly interested in differences in diet due to vegetarianism. This is a pattern we aren't interested in for the purposes of our study, and therefore is a confounding effect. 

We can use the PCA results to remove this confounding effect from our study. The first PC quantifies the 'meatyness' of an individual's diet, and therefore we can use PC1 to statistically 'subtract' the effect of eating meat from our original dataset. To do this, we would include PC1 in our statistical model that tests for differences between biologists and bioinformaticians. 

original model: 

biologist food vs. bioinformatician food

corrected model:

(biologist food - 'meatyness') vs. (bioinformatician food - 'meatyness')

The output of the corrected model will tell us if there are differences in food consumption between biologists and bioinformaticians after controlling for overall 'meatyness' of an individual's diet.

In a similar vein, often when scientists study human genetics, they are interested in how genes to relate to a disease or some other trait or characteristic (phenotype). Understanding the relationship between genetic variants and a phenotype of interest can be confounded by ancestry in the same way vegetarianism can confound our analysis of food consumption differences between groups of scientists. It is common practice to include the first two PCs into the statistical model when doing genetic association studies. 

Similarly, PCs are often used to correct for batch effects, or differences in samples due to differences in collection methods. The beauty of using PCs to remove confounding effects is that you don't need to know beforehand what confounding or batch effects will look like, you can learn about patterns from the dataset in an unbiased way. 


Buckley AR, Standish KA, Bhutani K, Ideker T, Lasken RS, Carter H, Harismendy O, Schork NJ. Pan-cancer analysis reveals technical artifacts in TCGA germline variant calls. BMC Genomics. 2017 Jun 12;18(1):458. doi: 10.1186/s12864-017-3770-y. PMID: 28606096; PMCID: PMC5467262.

All images from: https://pixabay.com/