Is PCA/UMAP used frequently in the industry for dim reduction for faster search speed?

nlpguy · February 9, 2022, 3:18pm

How to design the solution when data on which reducer is fit changes every month? Any case study worth sharing

Nate · February 11, 2022, 8:39am

Coming from a quant-bio background I’ve also wondered about this too, but haven’t seen it being used for vector search though. In genomic analysis carrying PCA/UMAP projections to unseen data/sample is common, but it is not re-calculated on the unseen data. The trained projection matrix is just reused.

greg · February 14, 2022, 1:54pm

@nlpguy This article might be helpful: PCA using Python (scikit-learn) | by Michael Galarnyk | Towards Data Science

NilsReimers · February 16, 2022, 9:32pm

Nandan Thakur has currently a paper finished on this (currently in review on my desk, we will publish it next month).

PCA worked rather badly in our experiments for semantic search - Performance quite dropped.

The question is what you want from smaller dimensions. Faster retrieval speed? Saving memory for the index? What works quite well:

Storing floats as FP16 or FP8 - Reduces the embedding matrix by size 2 / 4 without any sacrifice on the performance. If multiplications are faster depends on your hardware and if they support FP16
Using PQ to binarize your embeddings. Can reduce the embedding matrix significantly and leads to faster search (depending on your ANN index), but also reduces the performance.

nlpguy · February 17, 2022, 4:07am

Hey @greg! My doubt is more around performance and engineering aspect of using it for search.

Thanks @NilsReimers for the heads up! My motivation for this was coming from your insight on your SBERT site