I am created a vector database for an e-commerce skin and beauty care products using OpenAI embeddings and setting up Pinecone via LangChain.
The results are not always accurate. A number of questions/answers work great, but certain categories of questions result in semantically incorrect answers. When I use the retriever to look at documents for an example question it gets wrong “I am looking for a detoxifying lotion” all the hits are incorrect, it basically pulls back hair products that have no mention (keyword or semantic) about detoxifying, nor are they technically lotions. When I update the query, “I am looking for a detoxifying face lotion” the top 3 hits are correct but the 4th hit is still a miss with hair products (and there are other hits more relevant that should rank higher).
When I feed the raw data files into OpenAI directly and ask it to rank the relevant documents for these questions it never ranks the hair care product for either question – and ranks them completely correct – which leads me to want to better understand how Pinecone’s similarity search is working.
With this said, the Pinecone results are better when I create a significant amount of synthetic data to append to the original files, but I still get misses. I unfortunately cannot share the raw data here, but this is more of a general question on how to better understand what is Pinecone doing under the hood and why it is semantically going in the wrong direction.
- Is there a way to better understand why similarity search returns the results it returns
- Are there best practices to enrich/ enhance data so Pinecone can understand the data as well as ChatGPT does (I thought it was 1 to 1 considering I am using an OpenAI embedding).
Also if I am off-base here please let me know! Thanks