Pinecone sometimes returns further vectors to a query first, before returning closer vectors

narek · July 15, 2024, 12:57pm

Hello Pinecone Community,

So far, we had assumed that Pinecone’s scores are comparable in the same way distances are. So, when using L2 distance, for example, closer points to a query would ALWAYS get smaller scores. This assumption seems to fail at times. That is - Pinecone sometimes assigns smaller scores to further points from a query vector.

This issue makes scores returned by Pinecone unusable.

Could you help us understand how exactly pinecone-returned scores work and whether the discrepancy described in the attached gist is expected?

To reproduce the problem on Pinecone, you can use the following jupyter notebook, which has example output saved.
The example output shows that for a particular query, the first two results are vectors with IDs 96298 and 132797 that have assigned Pinecone scores of 38598, 38647 respectively.
So, according to the returned results, the vector with ID 96298 is closer to the query than the vector with ID 132797 (at least, that is our understanding of scores).

However, in reality, the order of the two returned vectors is the reverse -
The vector with ID 132797 is closer to the query vector, with an L2 distance of 180, than the vector with ID 96298 (L2 distance: 200).

Thanks for clarifications!

evan_pinecone · July 18, 2024, 4:49pm

@narek ,

Thank you for reaching out to the Pinecone Community Forum.

Pinecone uses an approximate-nearest-neighbor (ANN) instead of an exact search, trading off some accuracy for a boost in performance. This is essential for querying large datasets.

If you are curious, try indexing your dataset and performing the same queries with a Python ANN library, such as PyNNDescent. You should still see sorting discrepancies. This should prove that what you are observing is expected algorithmic behavior.

Please don’t hesitate to reach out with any other questions.

Best,

Evan

narek · July 19, 2024, 4:53pm

I think the issue I am describing is not related on ANN vs exact search.
Consider the example below and assume we are searching for vector A:

And let’s assume vectors B and C are returned as the nearest two vectors to A.

Per ANN approximations, it is OK that D is closer and it is not returned.

But, my understanding is that all nearest neighbor algorithms, be it exact or approximate, return correct ordering of vectors in the returned result set.

That is, when B and C are returned as the closest vectors to A, B should always receive a better score since it is closer.

Is the last sentence supposed to hold for Pinecone?

If so, then there may be an issue since the linked gist demonstrates a violation.

evan_pinecone · July 23, 2024, 5:15pm

@narek,

It is correct that when B and C are returned as the closest vectors to A, B should always receive a better score.

When vectors are upserted into a Pinecone index, Product Quantization is used to reduce the memory needed for storage and speed up nearest-neighbor search. In the process of quantization, some approximations of vector values are made. For the euclidean metric, the score is then computed as the square of the Euclidean distance. This leads to the differences in precision you have observed.

Best,

Evan