I found these docs about how metadata filtering works for serverless:
https://docs.pinecone.io/guides/data/filtering-with-metadata#considerations-for-serverless-indexes
“For string metadata, the metadata statistics for each cluster reflect a sampling of the values.”
Two questions come to mind:
-
Let’s say the cluster contains only 1 record with metadata field x=A. What if this record wasn’t selected during sampling? Does that mean this record will never be found when using metadata filter x=A?
-
How about metadata fields containing long strings, that are unique for the entire dataset? Keeping statistics like MIN/MAX or sampling won’t work for filtering by such strings. How does Pinecone serverless deal with that?
Thanks!
Hi @maciej.pocwierz, thanks for the post. I’ll address your questions below:
- In this case, we will scan the clusters until finding the record where filter x=A.
- The statistics for string metadata are a compressed set membership data structure (bloom filter) which has false-positive probability but zero probability for false-negative. In most cases, we will handle unique strings well.
1 Like
@zeke_pinecone
Thanks for the reply, thats informative!
So my understanding is that when filtering by metadata on a Pod-based index, you will always find the best matching vectors, because filtering is done using standard indexing methods.
But in case of serverless index, there are some cases where the best matching vectors will be missed.
Could you confirm?