HELP! NOT WORKING: Exclude id & filter id

bobziyangding · December 22, 2023, 8:29pm

I am trying to observe the behaviour of exclude_id or filter. However, it seems like the argument is not working at all! it keeps giving me same query result even if i exclude previous query result id in next query (everytime) Can anyone help to see what is going on?

Attached is my query and a screenshot of the output:

import pinecone
from collections import defaultdict

PINECONE_API_KEY = “something that is valid”

PINECONE_ENVIRONMENT = “gcp-starter”

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)

consumed_id = sorted([
“582c156e-314d-43bc-9cbb-002863af892f”,
“6847af8a-8273-4adb-bddc-1e678cbbe8cc”,
“8bcaf79c-83c5-4efc-b78c-7cc63cb28091”,
“c7577db8-8233-4380-bc40-ddd8a9cf3042”,
“6ef3ee6c-9d6f-48ec-90c8-19f8021ab44c”
])

seen_time = defaultdict(int)

for p in consumed_id:
seen_time[p] += 1

for i in range(10):

get read card data from pinecone

index = pinecone.Index(“iearn-content”)
results = index.query(
vector=vector,
top_k=5,
namespace=“read-card”,
exclude_ids = consumed_id
)

toRead_ids = sorted([result.id for result in results.matches])
print(toRead_ids)

for p in toRead_ids:
seen_time[p] += 1

consumed_id.extend(toRead_ids)
consumed_id = list(set(consumed_id))

Cory_Pinecone · December 22, 2023, 9:16pm

Hi @bobziyangding. Pinecone doesn’t have an “exclude_ids” option for queries. You can use metadata filtering to filter out those vectors, but they would have to have a common piece of metadata to do that.

You could do something like this first:

for id in consumed_id:
    update_response = index.update(
        id=id,
        set_metadata={'excluded': True},
        namespace='read-card'
    )

Then you would filter for those in your original query like so:

    results = index.query(
        vector=vector,
        top_k=5,
        namespace=“read-card”,
        filter={'excluded': { '$ne': True}}
    )

Keep in mind that the metadata will remain the same after all of this, so if you want to reset it you’ll need to set those to False, using the same update operation as before and just changing the boolean value.

bobziyangding · December 22, 2023, 9:47pm

Hi. @Cory_Pinecone Thank you so much for the help! However, I still need some sorts of id restriction because, essentially, what Im doing is I need to supply a list of content id that a user has already consumed, so that the query should only query results that ‘has not seen’ by a specific user.

In this way, for each user, its query is basically searching in a different subset of all the samples in the database, and that subset has to be specified in someway, and I was thinking about using id.

Given this situation, are you aware of any solution? I know that you can put id as a metadatafield, but that will immensly increase the cardinality and making the performance really bad…

Is there any alternative way to have this solved? I am okay with approximate and non-exact subset query solutions

bobziyangding · December 22, 2023, 9:59pm

Also, I just tried this:

And it still doesnt work. Also, the forloop update will be extremely slow when consumed_id gets long

s170559 · December 24, 2023, 2:23am

I have integrated a similar structure on my current platform. I’ve found that relying solely on Pinecone’s metadata for exclusion criteria isn’t the most efficient method. To address this, I’ve integrated Redis into the workflow for handling exclusions more effectively.

Our process begins by querying our Redis database to determine which items the user has already seen. Once we have this information, we then proceed to interact with Pinecone, using it to specifically exclude the IDs of items already viewed by the user. This approach is facilitated by two specific metadata filters within Pinecone, which are tailored to support this exclusion process.

This hybrid method of using Redis for initial exclusion checks, followed by Pinecone queries with refined exclusion criteria, has proven to be a more optimal solution in our use case.

But it is tricky…

Cory_Pinecone · January 4, 2024, 5:56pm

@bobziyangding while I’m obviously reticent about telling Pinecone users to add another database to their stack, I think @s170559 may have the optimal solution for you here. Would this be feasible for you to implement?

s170559 · January 17, 2024, 5:18pm

So here is the very basic flow

I use Upstash Redis since it is quick and easy, but it does not really matter. A simple CSV file could do the same xD (Witch i use alot haha)

system · January 31, 2024, 5:18pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.