Returning one result per unique metadata value

chris_minimus · May 18, 2023, 5:14pm

Hi, my apologies if this was covered in the docs, but I did not find it. I am using Pinecone with Langchain to do similarity search on some text data.

The documents have a common metadata field, and this field is high cardinality but not unique to the specific document.

Is there a way to tell the Pinecone API to return only the result with the highest score for a unique metadata value? Right now there’s a possibility that my top_n could include only results with the same value in that metadata field, but with the data I am working with I actually don’t want any duplicates, only the top scoring result.

Thanks.

Sean · May 18, 2023, 5:26pm

Can’t you set it to top 1, take the first item, and discard the other results?

chris_minimus · May 18, 2023, 5:48pm

Just to expand on it a bit - I expect queries to return multiple “possible matches”, but I only want the top result for each unique value for a particular metadata field. For example, in this set of results (score descending):

“field1”: “value1”
“field1”: “value1”
“field1”: “value1”
“field1”: “value2”
“field1”: “value1”
“field1”: “value3”
“field1”: “value3”
I only want 1, 4, and 6.

Sean · May 18, 2023, 5:58pm

Hey Chris - You’re thinking like me: “this is so easy to do in SQL” I’m not affiliated with pinecone, but from the documentation on metadata filtering, I’d suggest you’d need to either use your own logic that loops through your resultset, or use 3 metadata queries (from the metadata filter docs Metadata filtering):

index.query(
    vector=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
    filter={
        "genre": {"$eq": "documentary"},
        "year": 2019
    },
    top_k=1,
    include_metadata=True
)

# Returns:
# {'matches': [{'id': 'B',
#               'metadata': {'genre': 'documentary', 'year': 2019.0},
#               'score': 0.0800000429,
#               'values': []}],
#  'namespace': ''}

chris_minimus · May 18, 2023, 6:10pm

That’s kind of what I figured I would have to do. Unfortunately in my case I don’t know the values ahead of time, so post-results filtering it is.

Thanks!

scotteuser · May 9, 2025, 8:30pm

Hmmm this is a feature of the competition for example so it must be possible for Pinecone to build this feature Grouping Search | Milvus Documentation

Use case example: Suggested or Related content block. You don’t want the individual chunks back, you just want e.g. the top 5 related items.

If you have varying lengths of content some with maybe 100s of chunks, you would need to set TopK very high to be certain you’ll get 5 results in custom post processing.