I have a pinecone index that stores vectors related to food items.
My vectors are embedded json strings that contain the name of a food item and a short description. The reason I added a description was to enhance the retrieval quality.
This is how the embedded strings are formatted:
{“name”: “Pickle”, “description”: “Preserved cucumber, submerged in an acidic solution like vinegar or a saltwater brine, often with various spices, noted for their tangy flavor.”}
I have an example of one particular search operation that is confusing to me and makes me question whether there is a better way to format the strings that I embed.
I searched the db with a vector embedding of the following string:
{“name”: “Coconut Oil”, “original_query”: “cocnut oil”, “description”: “edible oil extracted from the kernel or meat of mature coconuts”}
The top three results were the following strings embedded into vectors:
top 1:
{“name”: “Olive Oil”, “description”: “fat extracted from olives, often used in cooking or as a condiment”}
top 2:
{“name”: “Coconut Oil”, “description”: “Coconut fat, often referred to as coconut oil, is a type of fat that is extracted from the meat of mature coconuts. It’s known for its unique composition, and high saturated fat content.”}
top3:
{“name”: “Coconut Fat”, “description”: “Coconut fat, often referred to as coconut oil, is a type of fat that is extracted from the meat of mature coconuts. It’s known for its unique composition, and high saturated fat content.”}
Now based on my understanding of how vectors work, I have no idea why Olive Oil would come in first before two entries that are so much closer to the query.
What can I do to improve my retrieval?