Let’s say I have a set of candidate vector IDs (i.e. from an external system), and I want to sort them by similarity score to a given query. How can I represent this as a Pinecone query?
This does not seem to be well supported by Metadata Filtering, as the docs say
High cardinality consumes more memory: Pinecone indexes metadata to allow
for filtering. If the metadata contains many unique values — such as a unique
identifier for each vector — the index will consume significantly more
memory. Consider using selective metadata indexing to avoid indexing high-cardinality metadata that is not needed for filtering.
Interesting. Filtering against a list of vector IDs isn’t supported, and as you rightly pointed out, having high cardinality in metadata can impact overall index performance. But there might be an option using sparse-dense vectors, instead. You could encode the vector IDs into the sparse vectors and use that to filter against in the query.
This notebook gives an example of doing something similar, but in e-commerce search when filtering on different types of products. But the principle is basically the same.
Thanks for the response, @Cory_Pinecone . So in that case, I would add all of the vector IDs to the query itself? Since I’m already using the sparse vector in this project, I worry that adding in the IDs might make it hard to tune the alpha parameter, but it’s worth a shot.
Having the very same challenge here. It’s a bit baffling that this is not readily available as a function.
what I’m doing is returning the vectors based on the ids and calculating the similarity manually and then ranking. solved for me but I’m still in disbelief that this is not implemented yet and I have to do it manually. Feels like I’m doing something wrong.
Are there any plans to support this in the near future?
This is absolutely a required feature for us and we’ll have to use a different DB if it is not supported.
And since we use sparse vectors for our similarity query (similar to the issue @steve1 had) filtering by ids in the sparse vector isn’t a great workaround for us
We are also forced to implement the same solution as hetnon, but there is little point to us using pinecone and its vector indexing if we need to calculate all the similarity scores manually.
Any update on this now that Pinecone is serverless? I assume memory isn’t much of an issue anymore. But is query speed slow when you do this filtering?
Did anyone find other vector dbs that support this filtering out of the box?