Cosine similarity, same datasets (one in memory local, one pinecone), different answers

I am currently working on a project where I am comparing the cosine similarity of vectors in a local memory dataset and a Pinecone database dataset. I am finding that the results are different between the two datasets (both using cosine similarity), even though they are supposed to be the same. Has anyone else experienced this issue or have any suggestions for troubleshooting?

I would appreciate any insights or advice on how to resolve this.

2 Likes

Hi @mdigital,

There are a lot of factors that can influence the accuracy of your search. What is the source dataset you’re using? Are you using any filters on top of the similarity search? And which library are you using to perform the search locally? All of these will have an impact on your results.

Cory

Hi Cory,

I am using an ada embeddings dataset, processed by openai, with 1536 dimension vectors. About 95k vectors total, with 3 or 4 properties tied to each vector. Literally the exact same data. Locally, I am using the cosine_similarity function in the openai python package. Does pinecone potentially truncate the 1536 floats to maybe be float16? They really should be getting the same answers, but they are not. My pinecone search on this dataset is cosine.

Thanks for the help!

I encoutered the same problem and I found the openai cosine_similarity gives much better result than pinecone query. Can some one help ?

both data and query use the same ada-002 embedding model from openai

I ran into something similar when comparing elasticsearch’s vector search to pinecone’s. I believe this may be due to pinecone using approximate KNN search instead of an exact KNN search:

Pinecone uses approximate-nearest-neighbor (ANN) search instead of k-nearest-neighbor (kNN). ANN is a method of performing nearest neighbor search where we trade off some accuracy for a massive boost in performance. Instead of going through the entire list of objects and finding the exact neighbors, it retrieves a “good guess” of an object’s neighbors.

Source: Ludicrous BERT Search Speeds | Pinecone

Would love if someone from pinecone could confirm this :grinning:

Also wondering if there is a way to query to pinecone using an exact KNN algorithem instead of ANN. I was led to believe it supports both in this article but I dont see any configuration in the python sdk that would allow for different search algo strategies.

1 Like