Let’s say I have a set of candidate vector IDs (i.e. from an external system), and I want to sort them by similarity score to a given query. How can I represent this as a Pinecone query?
This does not seem to be well supported by Metadata Filtering, as the docs say
High cardinality consumes more memory: Pinecone indexes metadata to allow
for filtering. If the metadata contains many unique values — such as a unique
identifier for each vector — the index will consume significantly more
memory. Consider using selective metadata indexing to avoid indexing high-cardinality metadata that is not needed for filtering.
To clarify, you want to run a query using a vector to get similar ones but limit the results to a fixed set of vector IDs. Is that right?
Hey @Cory_Pinecone , yes that’s exactly right!
Interesting. Filtering against a list of vector IDs isn’t supported, and as you rightly pointed out, having high cardinality in metadata can impact overall index performance. But there might be an option using sparse-dense vectors, instead. You could encode the vector IDs into the sparse vectors and use that to filter against in the query.
This notebook gives an example of doing something similar, but in e-commerce search when filtering on different types of products. But the principle is basically the same.
Thanks for the response, @Cory_Pinecone . So in that case, I would add all of the vector IDs to the query itself? Since I’m already using the sparse vector in this project, I worry that adding in the IDs might make it hard to tune the alpha parameter, but it’s worth a shot.