Euclidean distance vs. search space size

calexander0614 · January 2, 2025, 5:36pm

Calculating the euclidean distance between 2 vectors yields different results when querying different indices. I have 2 separate indices, one with 200,000+ embeddings and another with 5 embeddings. When I query both indices with the same embedding the euclidean distance with the top match (which returns an identical embedding among both indices) is different. The distance returned from the smaller index matches the distance I calculate locally (this is not the case with the larger index). Can I get confirmation that as the search space increases the euclidean distance will change? If this is the case would any of the other metrics (cosine similarity or dot product) alleviate this issue?

arjun · January 15, 2025, 11:10pm

Hi calexander0614, welcome to the forum!

Could you reply with the dataset you are using and its properties, in addition to your motivation for choosing the euclidean distance metric? If you can, including the exact results and discrepancy you are getting would help too.

This will help us in understanding what could be causing this discrepancy.

Additionally, what do you need the euclidean distance score for specifically? Many users instead use the rank (aka, the ordering of the results) of the results returned rather than the raw scores themselves. Does your use case require that the raw score be used?

The only hunch I have is that maybe an approximate search is starting to occur on the order of hundreds of thousands of vectors, whereas the smaller index would do exact, but it is hard to tell without understanding the data source.

Thanks in advance,

Sincerely,
Arjun