Returning one result per unique metadata value

Hi, my apologies if this was covered in the docs, but I did not find it. I am using Pinecone with Langchain to do similarity search on some text data.

The documents have a common metadata field, and this field is high cardinality but not unique to the specific document.

Is there a way to tell the Pinecone API to return only the result with the highest score for a unique metadata value? Right now there’s a possibility that my top_n could include only results with the same value in that metadata field, but with the data I am working with I actually don’t want any duplicates, only the top scoring result.


Can’t you set it to top 1, take the first item, and discard the other results?

Just to expand on it a bit - I expect queries to return multiple “possible matches”, but I only want the top result for each unique value for a particular metadata field. For example, in this set of results (score descending):

  1. “field1”: “value1”
  2. “field1”: “value1”
  3. “field1”: “value1”
  4. “field1”: “value2”
  5. “field1”: “value1”
  6. “field1”: “value3”
  7. “field1”: “value3”
    I only want 1, 4, and 6.

Hey Chris - You’re thinking like me: “this is so easy to do in SQL” :slight_smile: I’m not affiliated with pinecone, but from the documentation on metadata filtering, I’d suggest you’d need to either use your own logic that loops through your resultset, or use 3 metadata queries (from the metadata filter docs Metadata filtering):

    vector=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
        "genre": {"$eq": "documentary"},
        "year": 2019

# Returns:
# {'matches': [{'id': 'B',
#               'metadata': {'genre': 'documentary', 'year': 2019.0},
#               'score': 0.0800000429,
#               'values': []}],
#  'namespace': ''}

That’s kind of what I figured I would have to do. Unfortunately in my case I don’t know the values ahead of time, so post-results filtering it is.