TopK with unique a doc_id

magicseth · March 5, 2023, 7:00am

I’ve indexed a bunch of documents. I’m storing the id of the document in metadata with doc_id.

Each document could have many vectors representing different pieces of the document.

When I search, I’d like to return 10 different doc_id results.

Is it possible to tell pinecone to ignore vectors with duplicate doc_ids so I can guarantee 10 results, instead of having to request topk of 20 or 30?

kbutler · March 6, 2023, 5:40pm

Hi magicseth,
Based on the similarity search that happens when you query, the most similar results and provided in descending order by similarity score. There isn’t a mechanism in Pinecone to only ask for 10 results based on the uniqueness of metadata values. Can you describe your use case a little more? Perhaps there is an alternative in your design.

magicseth · March 6, 2023, 5:58pm

Imagine I have 100 pages on my website.

I want to create an embedding for every paragraph, and index them all.

In my example, I have a “volcanoes” page might with 15 paragraphs all about volcanos.

Now if I search for top 10 results for “volcanoes” paragraphs that all rank above everything else, but I’d ideally like to get the other results for “lava” and “seismology” pages as well. Event though their paragraphs aren’t in the top 10.

Does that make sense?

kbutler · March 6, 2023, 6:27pm

Ok. I think I get the idea. If you are trying to improve overall search results, have you considered this? SPLADE for Sparse Vector Search Explained | Pinecone and Sparse-dense embeddings

Otherwise, I think your approach of requesting a higher top-k, then collapsing them to their unique doc_ids as the final result set makes sense. Perhaps others will have some additional approaches I haven’t thought of.