Why High Dimensional Data Can Be So Troublesome
Have you ever been in the middle of telling someone a story or struggling through a long explanation of something complicated when the other person looks at you and asks, “What’s the point?”
First, your friend is so rude!
But we have also been in your friend’s shoes too — we are all busy people with places to go and folks to see. We want our information quick and to the point.
That is the essence of dimensionality reduction. When confronted with a ton of data, we can use dimensionality reduction algorithms to make the data “get to the point”.
In a previous post, I covered PCA, a powerful and versatile dimensionality reduction algorithm that unearths the underlying trends in our data. There are other options as well — such as non-negative matrix factorization, linear discriminant analysis and more (I promise to cover these in the future).
However, today our topic is not about a specific algorithm, but rather about why we need dimensionality reduction algorithms in the first place — The Curse of Dimensionality.
When is Data High Dimensional and Why Might That Be a Problem?
The Curse of Dimensionality sounds like something straight out of a pirate movie but what it really refers to is when your data has too many features.
The phrase, attributed to Richard Bellman, was coined to express the difficulty of using brute force (a.k.a. grid search) to optimize a function with too many input variables. Imagine running an Excel Solver optimization where there are 100,000 possible input variables, or in other words, 100,000 potential levers to pull (pretty sure that would crash your Excel).
In today’s big data world it can also refer to several other potential issues that arise when your data has a huge number of dimensions:
If we have more features than observations than we run the risk of massively overfitting our model — this would generally result in terrible out of sample performance.
When we have too many features, observations become harder to cluster — believe it or not, too many dimensions causes every observation in your dataset to appear equidistant from all the others. And because clustering uses a distance measure such as Euclidean distance to quantify the similarity between observations, this is a big problem. If the distances are all approximately equal, then all the observations appear equally alike (as well as equally different), and no meaningful clusters can be formed.