Hey, all! This is my first post to the Pinecone community. I posted this on the LangChain Discord first, but it doesn’t seem to be a LangChain issue.
Using Pinecone to store my vector embeddings goes perfectly well until I try to add metadata to the vectors. Adding the metadata works great. It’s when I then query the vector store that things go weird. The similarity search is suddenly terrible at retrieving vectors similar to the query. Sometimes it returns four of the same one! And in general it does a poor job. As soon as I index my data without the metadata it works perfectly.
I have two indexes set up in Pinecone. They’re identical except that one has metadata and the other doesn’t. When used in my LangChain-based chatbot, the one without metadata behaves exactly as I would expect, answering questions correctly. The one with the metadata seems to be brain dead. Instead of answering, the chatbot says “I’m sorry, but the document parts provided do not contain information about…”.
In my debug output I can see the vectors returned from the similarity search and the metadata associated with each. But as described, the vectors are very badly chosen, sometimes returning the same one multiple times.
This doesn’t make any sense to me. Does adding metadata to vectors really affect the results of a similarity search? I’m not the only one with this problem.