Poor similarity search results

I’m trying to build natural language searc using Pinecone Serverless to store embeddings created by Cohere-embed-multilingual-v3.0 in Amazon Bedrock, and use cosine similarity search with the REST API (/query). I fetch embeddings per sentence and store as a Vector in Pinecone with metadata containing the document ID. All sentences are in Swedish.

The issue is my search hits are poor. Even if I search for a word exactly as it appears in the sentence, I get hits for sentences that doesn’t even contain that word.

I have built a test that fetches embeddings of three different sentences, store them in a new index, and query based on a word in one of the sentences that doesn’t occur in the others. The query results rank the anticipated hit lower.

Hi @pontus, thanks for the post. There are a couple of points to consider/potential avenues to test the performance.

First, note that with Pinecone’s vector database, you are performing semantic searches. Thus, the presence of an individual word does not necessarily guarantee high similarity between the query vector and vectors that include the word. Receiving results that don’t contain the search term is certainly possible, and should be expected.

Secondarily, I encourage you to try out other embedding models, or perhaps test your same text in a different language to evaluate the performance of the specific model you are using. The ability of the model to create performant embeddings in the language of your text is critical and could play a role here.

Lastly, please feel free to share how you are querying your index. I’m curious to take a look and ensure that everything Pinecone-related is set up optimally. Looking forward to hearing from you!