I’m using Pinecone Serverless to store 9M vectorized chunks of text. Each chunk is assigned to one or more categories, enabling users to filter their searches by category.
Our system organizes categories in a tree structure, where each category has a parent_category_id. When a user searches within a root category, the filter must include all its descendant categories. The largest root category has around 5000 descendants.
So the chunk metadata would be like:
{
"category_ids": [123, 234, 345],
...
}
And the search filter would be:
{
"category_ids": {"$in":[ possibly thousands of values here ]}
}
My questions are:
Could passing 5000 values to the $in filter result in low recall or poor performance?
Let’s brainstorm a bit- can you tell us more about the underlying dataset and how you’re using the results? I’m not even certain you can pass 5000 values to the $in filter (though I can find out what the actual limit is). 5000 categories suggests to me a type of detailed taxonomy that may already overlap with the semantic meaning– but I don’t know your data set, so it’s hard to say.
I have a feeling that using a reranker, or engaging in a post-filtering of your top results, will be much more efficient here. Tell us more and we can think through it!
OK- thank you for the extra detail! Yes, you’re right, you should be able to pass that many values to the filter, though it’s hard to predict offhand what the performance hit may be.
If you are interested in fetching the entire contents of a section of the law to pass to the LLM, that sounds like it’s likely more efficiently done as a fetch from a relational DB that has the contents of the law structured by sections. But please correct me if I’m still not quite understanding the user flow here and how you’re determining what gets sent to the LLM.