Question about searching for indexes with a keyword in their metadata?

Cory_Pinecone · February 13, 2023, 6:39pm

For the use case - we are an integration company and to explore this tech I am building a slack bot. When a user asks a question the bot design will first check for a similar question via openai embeddings and this pinecone data set. If it doesn’t find a close match the bot design will fall back to a natural language answer via openai completions. It’s still very much in the fiddling and testing phase with a small group – nothing production and we are favoring speed of iteration over perfect accuracy in the dataset at this phase.

Nice mix of existing embeddings and NLP answers via GPT. I like it.

The idea is that the admins will have an option to search replace an answer with a different better one (in a UI that we are already building – the key question is just the API call). The group of admins are experts on different aspects of the question/answer set. We are initially crowd sourcing a big set of q&a data but would also like the ability to let e.g. our expert on EDI improve an EDI answer that someone else may have provided. The main focus will be on updating the metadata but I might also change the vector of a given index ID. For clarity I have the update mechanisms sorted out, its just the ability to give these folks a keyword sort of lookup mechanism via the API, to find specific items based on their keyword being in the metadata, that has me stumped currently.

Ok I see what you’re going for here. Letting the experts refine the answers over time does make sense. And storing tokens/keywords as metadata is one way to do it. I wonder if it will be the most performant approach, though. We have a new sparse/dense vector index under development, it should be in public preview in the next couple of weeks. I think using that approach will be much more efficient in the long run for finding specific vectors that match a keyword, rather than filtering on metadata. Metadata filtering can be very fast but is really only designed for values that have a low cardinality. If each vector has their own set of metadata/keywords, that can have an impact on both performance and number of vectors you can fit in a single pod (meaning more pods per index, and higher costs overall).

Keep an eye on our newsletter and learning center, and follow our LinkedIn page. That’s where we’ll announce when the private preview for sparse/dense vectors is available.

In the meantime, this is an overview of a very popular sparse/dense model: https://dl.acm.org/doi/10.1145/3404835.3463098

We also have an article on the difference between sparse and dense vectors: https://www.pinecone.io/learn/dense-vector-embeddings-nlp/#dense-vs-sparse-vectors