Query result is irrelavent

jodi · April 5, 2023, 7:31am

Hi,
I am new to pinecone. I am using it in conjunction with open chat feature.

I get all the vectors from openai embedding endpoints (both data and query).

When I try to query pinecone, I seem to get back some unrelated data (score 0.7+) while a particular data that I think is quite relevant is not being returned.

For example the query below:
{
“input”: “can a payment plan be created manually in membes”,
“model”: “text-embedding-ada-002”
}

with the returned vector, I try to query the pinecone db, I got a list of matches, one example below:
{
“id”: “”,
“score”: 0.743270099,
“values”: ,
“metadata”: {
“text”: “The Group field will appear underneath, and from there select the relevant Profile Group the event is to be allocated to.\n\nProceed with completing your event set up as per usual.\n\nProfile interactions\nCommunications with members can be recorded via profile interactions. In addition to the standard system interactions, custom interaction types can be created and reported on.\n\n\nView Profile Interactions\n\nFirst, search for the relevant profile.\nOn the profile dashboard, the list of interactions appears on the right side of the screen.\n\n\nIn the list of profile interactions you can see the title of each entry, click on the tile to expand it to see the full entry details.\n\n\nThis list can be filtered by clicking the Filter button in the top right, you can select the types interactions you wish to see by checking or unchecking the box next to them. Once you’ve checked the relevant boxes, click the Filter button to display them.”
}
}

Which is quite unrelated to the query above (the other 9 matches has exactly the same score as above result).

Furthermore, there is an entry in pinecone which should be more related to the question above, as shown below:

{
“id”: “9b382c22-d6a9-4a3a-9c1b-70cb89c5ba3e”,
“metadata”: {
“text”: “Click the Login to Website as This User button.\n\nYou will be taken to the website where you will be able to see exactly what the member sees.\nWhen you have finished, ensure you click the Logout From button. If you do not logout you will not be able to use the administration system.\n\n\n\nPayment plan\nIf you have enabled a Pinch merchant account, you can give your members the option to pay their membership in instalments either via direct debit from their bank account or recurring credit card payments.\nA payment plan can also be created manually via their profile if required, for example if a member is experiencing financial distress and cannot pay the entire membership fee at once.\n\n\n\nView payment plan\nOn a profile go to Other > Payment Plan.\n\n\nIf there is an active payment plan, the details will be displayed on the screen.\n\n\n\n\nView payment plan Edit payment plan\nCreate a manual payment plan”
}
}

But not being returned by the above query.

Anyone know what is wrong with my query?

Thank you for your help

Regards,
Jodi

jamesbriggs · April 6, 2023, 3:52am

The other 9 results have the exact same score? Meaning 0.743270099? If that is the case I think it’s very likely there is an error in your indexing pipeline, as finding the exact same score between different embeddings is very unlikely (if not impossible) as far as I know.

Other than that, I’m not sure what could be the problem other than the embedding model itself. So if you don’t find any errors in your indexing process I’d recommend you try other embedding methods like Cohere’s embedding models, or alternatively, you could try sparse-dense search — if you follow the linked guide just swap the msmarco-bert-base-dot-v5 model for your current setup with text-embedding-ada-002.

I hope that helps!

roland_alden · April 6, 2023, 6:35pm

I have a similar situation/question. I am NOT seeing the same score but my theory of the case points to topK and how it works.

I have just seven vectors and my query vector is two words: “metal abstraction”

I notice that the top score is 0.80+. That’s an article about abstraction in computer science. Makes sense.

The second ranked score is 0.77+ and that is an article about Metal as a Service.

The trouble starts with the rank 3 score. 0.75+. The article is about Stripe Payments. In fact there is no article text; just that heading “Stripe Payments”.

This surprises me. I would expect a drastic falloff in score. I noticed this problem because I had topK set to 100 and thus was surprised when all seven vectors came back.

I was thinking of “topK” as a kind of limit. Don’t bother me with more results… But it seems like it means “get k results no matter what”.

That makes sense I guess. But… what it seems to be doing is distributing the score in some way I don’t understand. If the score can be -1 to +1 would a topK of 100 automatically mean 50 scores will be < 0 and 50 > 0?

I was assuming the score is based only on the pairwise analysis of a single vector and a query vector. And thus, in my example I would have expected only two vectors to score high, and the others would be drastically lower.

I suppose my sample set here is small but I also can imagine that with thousands or millions of vectors what will happen is that all results will “seem reasonable” so by studying this very small case I hope to gain some understanding of how it all really works.

jodi · April 11, 2023, 2:12am

Hi All,
Thanks for all the responses.

It turns out, as James pointed out, there is logic error in the indexing pipeline. Particularly, the storing of the index and the data.
When I pass a list of text to openai for embedding, the embedding came back not in the order of the submitted list. So now I have fix the process to match the index and the text base on the “index” field returned by openai for storing them to pinecone.

I am not sure what happened to the “same score” issue in my question above, but it is not happening anymore once I fix the bug.

Just a note on Roland response. In my case, any score below 0.8 seem to be quite irrelevant, so I have a filter to cut out anything below that.