I am currently working on a project where I am comparing the cosine similarity of vectors in a local memory dataset and a Pinecone database dataset. I am finding that the results are different between the two datasets (both using cosine similarity), even though they are supposed to be the same. Has anyone else experienced this issue or have any suggestions for troubleshooting?
I would appreciate any insights or advice on how to resolve this.
There are a lot of factors that can influence the accuracy of your search. What is the source dataset you’re using? Are you using any filters on top of the similarity search? And which library are you using to perform the search locally? All of these will have an impact on your results.
I am using an ada embeddings dataset, processed by openai, with 1536 dimension vectors. About 95k vectors total, with 3 or 4 properties tied to each vector. Literally the exact same data. Locally, I am using the cosine_similarity function in the openai python package. Does pinecone potentially truncate the 1536 floats to maybe be float16? They really should be getting the same answers, but they are not. My pinecone search on this dataset is cosine.
I ran into something similar when comparing elasticsearch’s vector search to pinecone’s. I believe this may be due to pinecone using approximate KNN search instead of an exact KNN search:
Pinecone uses approximate-nearest-neighbor (ANN) search instead of k-nearest-neighbor (kNN). ANN is a method of performing nearest neighbor search where we trade off some accuracy for a massive boost in performance. Instead of going through the entire list of objects and finding the exact neighbors, it retrieves a “good guess” of an object’s neighbors.
Would love if someone from pinecone could confirm this
Also wondering if there is a way to query to pinecone using an exact KNN algorithem instead of ANN. I was led to believe it supports both in this article but I dont see any configuration in the python sdk that would allow for different search algo strategies.