Issue with Filtering Results Based on Unique Session ID

chunyew24 · October 8, 2023, 3:49pm

Dear community,

I need help regarding an issue I have encountered while trying to filter results based on a unique session ID (unique_id metadata) during similarity searches.

Project Overview: I have developed a Streamlit application that matches uploaded documents against a given reference document. Each uploaded document is associated with a unique session ID (unique_id) to ensure results are fetched only for the current session and not from previous uploads by different users.

Issue Encountered: When I attempt to filter results based on unique_id during similarity searches, I receive an empty list. However, when I remove the unique_id filter, I do get results, but they include documents from other sessions, which is not the intended behavior.

Steps Taken:

Each uploaded document is associated with a unique_id and stored in Pinecone.
During the similarity search, I tried filtering using unique_id but got an empty result.
I also tried post-query filtering by fetching more results and then filtering them based on the unique_id in my code. However, it seems that none of the retrieved documents match the unique_id of the current session.

Request:

Can you provide insights on why the unique_id filtering might not be working as expected?
Is there a specific method or query structure to filter results based on metadata during similarity searches?
Any best practices or recommendations to achieve the desired functionality would be greatly appreciated.

Thank you for your assistance. I look forward to your guidance on this matter.

Cory_Pinecone · October 8, 2023, 4:27pm

How are you filtering for these unique IDs? Are you passing a single ID expecting it to match multiple vectors, or filtering for multiple IDs? In which case your filter could be looking for vectors that contain all of them.

Would you please share your filter logic so we can know exactly how you’re doing this?

chunyew24 · October 8, 2023, 6:33pm

How are you filtering for these unique IDs? In the code, I am associating a unique session ID (unique_id) with each uploaded document. This unique_id is stored as metadata with the vectorized document in Pinecone. The intention behind this is to retrieve only the documents uploaded in the current session and exclude any documents uploaded in previous sessions.
Are you passing a single ID expecting it to match multiple vectors, or filtering for multiple IDs? I am passing a single unique session ID with the expectation that it will match multiple vectors (documents) that were uploaded during the same session. Each session has its distinct unique_id.
Would you please share your filter logic so we can know exactly how you’re doing this?
def push_to_pinecone(pinecone_apikey,pinecone_environment,pinecone_index_name,embeddings,docs):

pinecone.init(
api_key=pinecone_apikey,
environment=pinecone_environment
)
print(“done…2”)
Pinecone.from_documents(docs, embeddings, index_name=pinecone_index_name)

#Function to pull infrmation from Vector Store - Pinecone here
def pull_from_pinecone(pinecone_apikey,pinecone_environment,pinecone_index_name,embeddings):

pinecone.init(
api_key=pinecone_apikey,
environment=pinecone_environment
)

index_name = pinecone_index_name

index = Pinecone.from_existing_index(index_name, embeddings)
return index

#Function to help us get relavant documents from vector store - based on user input
def similar_docs(query,k,pinecone_apikey,pinecone_environment,pinecone_index_name,embeddings,unique_id):

pinecone.init(
api_key=pinecone_apikey,
environment=pinecone_environment
)

index_name = pinecone_index_name

index = pull_from_pinecone(pinecone_apikey,pinecone_environment,index_name,embeddings)
similar_docs = index.similarity_search_with_score(query, int(k),{"unique_id":unique_id})
#print(similar_docs)
return similar_docs

I have also tried:
all_similar_docs = index.similarity_search_with_score(query, int(k) * 5)
filtered_docs = [doc for doc in all_similar_docs if doc[0].metadata.get(‘unique_id’) == unique_id]

First, I fetch a larger set of similar documents (5 times the desired number). Then, I filter these results in my code based on the unique_id associated with the current session. But non of these two method works.

chunyew24 · October 8, 2023, 6:36pm

chunyew24 · October 8, 2023, 6:38pm

The error log:

C:\Users\User\anaconda3\lib\site-packages\langchain_init_.py:40: UserWarning: Importing HuggingFaceHub from langchain root module is no longer supported.
warnings.warn(
done…2
C:\Users\User\anaconda3\lib\site-packages\langchain_init_.py:40: UserWarning: Importing HuggingFaceHub from langchain root module is no longer supported.
warnings.warn(
done…2
Warning: Filtering on unique_id resulted in an empty list. Returning top results without filtering.
Uploading: {‘name’: ‘java-programmer-resume-example.pdf’, ‘id’: 1, ‘type=’: ‘application/pdf’, ‘size’: 25387, ‘unique_id’: ‘79ca2110cf42421aaeaf62d6d162ddae’}
Uploading: {‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘id’: 2, ‘type=’: ‘application/pdf’, ‘size’: 49047, ‘unique_id’: ‘79ca2110cf42421aaeaf62d6d162ddae’}
Uploading: {‘name’: ‘python-developer-resume-example.pdf’, ‘id’: 3, ‘type=’: ‘application/pdf’, ‘size’: 33734, ‘unique_id’: ‘79ca2110cf42421aaeaf62d6d162ddae’}
Uploading: {‘name’: ‘security-engineer-resume-example.pdf’, ‘id’: 4, ‘type=’: ‘application/pdf’, ‘size’: 31316, ‘unique_id’: ‘79ca2110cf42421aaeaf62d6d162ddae’}
Uploading: {‘name’: ‘senior-programmer-resume-example.pdf’, ‘id’: 5, ‘type=’: ‘application/pdf’, ‘size’: 31485, ‘unique_id’: ‘79ca2110cf42421aaeaf62d6d162ddae’}
Uploading: {‘name’: ‘software-engineer-iii-front-end-resume-example.pdf’, ‘id’: 6, ‘type=’: ‘application/pdf’, ‘size’: 46300, ‘unique_id’: ‘79ca2110cf42421aaeaf62d6d162ddae’}
done…2
Pushing to Pinecone: {‘name’: ‘java-programmer-resume-example.pdf’, ‘id’: 1, ‘type=’: ‘application/pdf’, ‘size’: 25387, ‘unique_id’: ‘79ca2110cf42421aaeaf62d6d162ddae’}
Pushing to Pinecone: {‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘id’: 2, ‘type=’: ‘application/pdf’, ‘size’: 49047, ‘unique_id’: ‘79ca2110cf42421aaeaf62d6d162ddae’}
Pushing to Pinecone: {‘name’: ‘python-developer-resume-example.pdf’, ‘id’: 3, ‘type=’: ‘application/pdf’, ‘size’: 33734, ‘unique_id’: ‘79ca2110cf42421aaeaf62d6d162ddae’}
Pushing to Pinecone: {‘name’: ‘security-engineer-resume-example.pdf’, ‘id’: 4, ‘type=’: ‘application/pdf’, ‘size’: 31316, ‘unique_id’: ‘79ca2110cf42421aaeaf62d6d162ddae’}
Pushing to Pinecone: {‘name’: ‘senior-programmer-resume-example.pdf’, ‘id’: 5, ‘type=’: ‘application/pdf’, ‘size’: 31485, ‘unique_id’: ‘79ca2110cf42421aaeaf62d6d162ddae’}
Pushing to Pinecone: {‘name’: ‘software-engineer-iii-front-end-resume-example.pdf’, ‘id’: 6, ‘type=’: ‘application/pdf’, ‘size’: 46300, ‘unique_id’: ‘79ca2110cf42421aaeaf62d6d162ddae’}
Retrieved: {‘id’: 2.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘56251a61f71f4155a327fd58ddfb4074’}
Retrieved: {‘id’: 2.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘a47dce9f1ece4911ad3925e41cd82223’}
Retrieved: {‘id’: 16.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘2dc07703428b4f6c84f52a59ca8b68c7’}
Retrieved: {‘id’: 16.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘dc6b3e7a1e354cf3b6b385aaf247cff7’}
Retrieved: {‘id’: 16.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘0119e86f034440a295344c78729cf9d3’}
Retrieved: {‘id’: 2.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘818084fa4cd34e059f4faf6aa2c29c2a’}
Retrieved: {‘id’: 2.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘9418ea8751b841aea6eb64f6f5870557’}
Retrieved: {‘id’: 3.0, ‘name’: ‘python-developer-resume-example.pdf’, ‘size’: 33734.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘9418ea8751b841aea6eb64f6f5870557’}
Retrieved: {‘id’: 3.0, ‘name’: ‘python-developer-resume-example.pdf’, ‘size’: 33734.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘fa57fdd87019475c9072c4e58c384652’}
Retrieved: {‘id’: 3.0, ‘name’: ‘python-developer-resume-example.pdf’, ‘size’: 33734.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘818084fa4cd34e059f4faf6aa2c29c2a’}
Warning: Filtering on unique_id resulted in an empty list. Returning top results without filtering.
Uploading: {‘name’: ‘java-programmer-resume-example.pdf’, ‘id’: 1, ‘type=’: ‘application/pdf’, ‘size’: 25387, ‘unique_id’: ‘3c3e4175fdfe4d0394e04b4f164f036c’}
Uploading: {‘name’: ‘python-developer-resume-example.pdf’, ‘id’: 3, ‘type=’: ‘application/pdf’, ‘size’: 33734, ‘unique_id’: ‘3c3e4175fdfe4d0394e04b4f164f036c’}
Uploading: {‘name’: ‘security-engineer-resume-example.pdf’, ‘id’: 4, ‘type=’: ‘application/pdf’, ‘size’: 31316, ‘unique_id’: ‘3c3e4175fdfe4d0394e04b4f164f036c’}
Uploading: {‘name’: ‘senior-programmer-resume-example.pdf’, ‘id’: 5, ‘type=’: ‘application/pdf’, ‘size’: 31485, ‘unique_id’: ‘3c3e4175fdfe4d0394e04b4f164f036c’}
Uploading: {‘name’: ‘software-engineer-iii-front-end-resume-example.pdf’, ‘id’: 6, ‘type=’: ‘application/pdf’, ‘size’: 46300, ‘unique_id’: ‘3c3e4175fdfe4d0394e04b4f164f036c’}
done…2
Pushing to Pinecone: {‘name’: ‘java-programmer-resume-example.pdf’, ‘id’: 1, ‘type=’: ‘application/pdf’, ‘size’: 25387, ‘unique_id’: ‘3c3e4175fdfe4d0394e04b4f164f036c’}
Pushing to Pinecone: {‘name’: ‘python-developer-resume-example.pdf’, ‘id’: 3, ‘type=’: ‘application/pdf’, ‘size’: 33734, ‘unique_id’: ‘3c3e4175fdfe4d0394e04b4f164f036c’}
Pushing to Pinecone: {‘name’: ‘security-engineer-resume-example.pdf’, ‘id’: 4, ‘type=’: ‘application/pdf’, ‘size’: 31316, ‘unique_id’: ‘3c3e4175fdfe4d0394e04b4f164f036c’}
Pushing to Pinecone: {‘name’: ‘senior-programmer-resume-example.pdf’, ‘id’: 5, ‘type=’: ‘application/pdf’, ‘size’: 31485, ‘unique_id’: ‘3c3e4175fdfe4d0394e04b4f164f036c’}
Pushing to Pinecone: {‘name’: ‘software-engineer-iii-front-end-resume-example.pdf’, ‘id’: 6, ‘type=’: ‘application/pdf’, ‘size’: 46300, ‘unique_id’: ‘3c3e4175fdfe4d0394e04b4f164f036c’}
Retrieved: {‘id’: 2.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘a47dce9f1ece4911ad3925e41cd82223’}
Retrieved: {‘id’: 16.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘0119e86f034440a295344c78729cf9d3’}
Retrieved: {‘id’: 2.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘79ca2110cf42421aaeaf62d6d162ddae’}
Retrieved: {‘id’: 16.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘dc6b3e7a1e354cf3b6b385aaf247cff7’}
Retrieved: {‘id’: 16.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘2dc07703428b4f6c84f52a59ca8b68c7’}
Retrieved: {‘id’: 2.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘818084fa4cd34e059f4faf6aa2c29c2a’}
Retrieved: {‘id’: 2.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘9418ea8751b841aea6eb64f6f5870557’}
Retrieved: {‘id’: 2.0, ‘name’: ‘principal-software-engineer-resume-example.pdf’, ‘size’: 49047.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘56251a61f71f4155a327fd58ddfb4074’}
Retrieved: {‘id’: 3.0, ‘name’: ‘python-developer-resume-example.pdf’, ‘size’: 33734.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘9418ea8751b841aea6eb64f6f5870557’}
Retrieved: {‘id’: 3.0, ‘name’: ‘python-developer-resume-example.pdf’, ‘size’: 33734.0, ‘type=’: ‘application/pdf’, ‘unique_id’: ‘fa57fdd87019475c9072c4e58c384652’}
Warning: Filtering on unique_id resulted in an empty list. Returning top results without filtering.

Cory_Pinecone · October 9, 2023, 3:12pm

Thanks @chunyew24 , let me review through this and get back to you with more updates later.

But a quick suggestion to help improve your process: have you considered using namespaces for your sessions instead? It sounds like you aren’t querying across multiple sessions and if that’s accurate then segmenting by namespace would be much more efficient than using metadata filters.

silas · October 9, 2023, 4:44pm

From my understanding, Pinecone applies metadata filters after the semantic search process completes. It does not use the filter before the search to limit the set of possible matches.

If that’s what you need, then I think using namespaces like @Cory_Pinecone suggested would be the way to do that.

tim · October 9, 2023, 8:37pm

For sure you want to namespace sessions as it makes the “scope” apparent without having to do expensive metadata filtering. The bonus to this is the ability to quickly delete entire namespaces or “sessions” to keep your pod clear - since deleting by metadata would be pretty long-running depending on the amount of data.

namespace are free so just make a uuid for each namespace and track it in your application DB so you can tie the user session to the pinecone namespace to focus queries.

Cory_Pinecone · October 10, 2023, 2:35pm

@silas that’s not correct; metadata filtering happens first. So the only vectors that are searched against are those that match your filter.

I still think using namespaces would be more efficient in this case. @tim makes some good points about data management in namespaces, too.

chunyew24 · October 12, 2023, 5:00pm

I’ve recently made changes to my code to transition from using a unique ID system to a namespace-based approach for handling documents in Pinecone. However, after making this change, I’m encountering an issue where the result returned is an empty list, even though the documents were pushed successfully.

Here’s a brief overview of the changes I made:

Previously, every batch of documents pushed to Pinecone was associated with a unique ID. I’ve now changed this to a namespace-based approach.
When I query the Pinecone index to retrieve similar documents based on a given job description, I’m expecting the documents within the same namespace to be returned. However, the result is consistently an empty list.
Here’s a snippet of the relevant functions:

Create and return an embeddings instance for data

def create_embeddings_load_data():
embeddings = SentenceTransformerEmbeddings(model_name=“all-MiniLM-L6-v2”)
return embeddings

Ensure that the Pinecone index exists, if not, create one

def ensure_pinecone_index_exists(index_name):
existing_indexes = pinecone.list_indexes()
if PINECONE_INDEX_NAME not in existing_indexes:
pinecone.create_index(name=PINECONE_INDEX_NAME, dimension=384)

Push data to Pinecone Vector Store

def push_to_pinecone(pinecone_apikey, pinecone_environment, namespace, embeddings, docs):
pinecone.init(api_key=pinecone_apikey, environment=pinecone_environment)
ensure_pinecone_index_exists(PINECONE_INDEX_NAME)
Pinecone.from_documents(docs, embeddings, index_name=PINECONE_INDEX_NAME)

Pull information from the Pinecone Vector Store

def pull_from_pinecone(pinecone_apikey, pinecone_environment, namespace, embeddings):
pinecone.init(api_key=pinecone_apikey, environment=pinecone_environment)
return Pinecone.from_existing_index(PINECONE_INDEX_NAME, embeddings)

Retrieve documents from vector store that are similar to a given query

def similar_docs(query, k, pinecone_apikey, pinecone_environment, namespace, embeddings):
index = pull_from_pinecone(pinecone_apikey, pinecone_environment, namespace, embeddings)
all_similar_docs = [doc for doc in index.similarity_search_with_score(query, int(k))
if ‘namespace’ in doc[0].metadata and doc[0].metadata[‘namespace’] == namespace]
return all_similar_docs

Generate a summary for a given document

def get_summary(current_doc):
llm = OpenAI(temperature=0)
chain = load_summarize_chain(llm, chain_type=“map_reduce”)
summary = chain.run([current_doc])
return summary

I’ve confirmed that the documents are being pushed correctly with the right namespace metadata. But, the similar_docs function always returns an empty list. I’ve added checks to ensure that the document has the ‘namespace’ metadata before processing, yet the result remains unchanged.

silas · October 13, 2023, 5:31pm

One thing that jumps out at me is this line seems like it could be filtering out results, so I would start there in your debugging.

all_similar_docs = [doc for doc in index.similarity_search_with_score(query, int(k))
                    if ‘namespace’ in doc[0].metadata and doc[0].metadata[‘namespace’] == namespace]

Based on how the Pinecone APIs work, you don’t need to do namespace filtering at the client level when you’re processing search results. Just specify the namespace you want the query to be limited to, and Pinecone will only search that namespace. Any search results are guaranteed to be from that namespace.

Also, I think using Langchain sometimes introduces unnecessary or over complicated layers of abstraction vs. just using the Pinecone SDK directly, which is easy to do. You can reference this post that contains lots of examples of using the Pinecone SDK: https://www.ninetack.io/post/first-steps-with-pinecone-db