Straightforward Guide to Dimensionality Reduction

discobot · November 15, 2022, 9:28pm

There is a golden rule in Machine Learning that states: the more data, the better. This rule is a double-edged sword. An indiscriminate addition of data might introduce noise, reduce model performance, and slow down its training process. In this case, more data can hurt model performance, so it’s essential to understand how to deal with it.

In Machine Learning, “dimensionality” refers to the number of features (i.e. input variables) in a dataset. These features (represented as columns in a tabular dataset) fed into a Machine Learning model are called predictors or “p”, and the samples (represented as rows) “n“. Most Machine Learning algorithms assume that there are many more samples than predictors. Still, sometimes the scenario is exactly the opposite, a situation referred to as “big-p, little-n” (“p” for predictors, and “n” for samples).

This is a companion discussion topic for the original entry at https://www.pinecone.io/learn/dimensionality-reduction/