I am currently working on a project where I am comparing the cosine similarity of vectors in a local memory dataset and a Pinecone database dataset. I am finding that the results are different between the two datasets (both using cosine similarity), even though they are supposed to be the same. Has anyone else experienced this issue or have any suggestions for troubleshooting?
I would appreciate any insights or advice on how to resolve this.
There are a lot of factors that can influence the accuracy of your search. What is the source dataset you’re using? Are you using any filters on top of the similarity search? And which library are you using to perform the search locally? All of these will have an impact on your results.
I am using an ada embeddings dataset, processed by openai, with 1536 dimension vectors. About 95k vectors total, with 3 or 4 properties tied to each vector. Literally the exact same data. Locally, I am using the cosine_similarity function in the openai python package. Does pinecone potentially truncate the 1536 floats to maybe be float16? They really should be getting the same answers, but they are not. My pinecone search on this dataset is cosine.
Thanks for the help!