Question about searching for indexes with a keyword in their metadata?

chris.cappetta · February 9, 2023, 10:28pm

Hi, first post here. I didn’t find my specific question from a quick search.

I’m indexing question/answer pairs (each under their own namespaces of the same index) and then attaching the original question and answer to each index item as metadata. I’d like to be able to pull back all items where the metadata contains a keyword. My objective is to have an app features where my admins can pull back all index items where there is a matching keyword, and then edit/improve the question answer pairs.

I see where I can search for any index items where the metadata equals X,Y,Z keywords (or equals one of X,Y,Z keywords) but I’m not seeing a way to have the search pull back any items where X,Y,Z keyword is contained within a longer metadata sentence. E.g. if I search a keyword of “PCI” I would ideally like to return any items where the question metadata includes that term. “Is your company PCI compliant”, “Is your PCI/DSS certification up to date”, etc.

Am I missing a way to do this? Or is there a better way to carry the question and answer along with the index as it’s being fetched in real time that would help support this “metadata substring” sort of searching capability.

Thanks for reading!

Cory_Pinecone · February 9, 2023, 11:37pm

Hi @chris.cappetta,

When your admins edit the question/answer pairs, is the idea that they’ll replace the answer with a different one? And how would they determine what a better answer would be? Are they only updating metadata, or are they updating the vectors?

And are you storing just a single question/answer pair per namespace? That doesn’t sound like a very efficient approach. Can you share the use case you’re working with that requires you to segment your data that way?

This sounds like a very lexical, relational approach to search. Using semantic search and simply updating your vectors over time would seem much easier and wouldn’t require you to store all that metadata in the first place. But maybe there’s a piece to this I’m missing, so if you could expand on how your app is being built and how you envision using it, that would be a big help.

chris.cappetta · February 10, 2023, 5:35pm

Thanks for the reply @Corey_pinecone !

Yep I’m happy to provide more detail, and it could well be that I’m not doing this in the most elegant way – it is my first foray into this sort of tech so I’m eager for your insights!

For the use case - we are an integration company and to explore this tech I am building a slack bot. When a user asks a question the bot design will first check for a similar question via openai embeddings and this pinecone data set. If it doesn’t find a close match the bot design will fall back to a natural language answer via openai completions. It’s still very much in the fiddling and testing phase with a small group – nothing production and we are favoring speed of iteration over perfect accuracy in the dataset at this phase.

The idea is that the admins will have an option to search replace an answer with a different better one (in a UI that we are already building – the key question is just the API call). The group of admins are experts on different aspects of the question/answer set. We are initially crowd sourcing a big set of q&a data but would also like the ability to let e.g. our expert on EDI improve an EDI answer that someone else may have provided. The main focus will be on updating the metadata but I might also change the vector of a given index ID. For clarity I have the update mechanisms sorted out, its just the ability to give these folks a keyword sort of lookup mechanism via the API, to find specific items based on their keyword being in the metadata, that has me stumped currently.

For your question on unique namespaces per pair - no we are using two main namespaces and indexing all of the questions and answers in their respective namespaces (but mainly using the question for the search currently). But there is a unique metadata value for each question and answer.

E.g. an index item may have a namespace of rfp_q with the vector of the question and rfp_a for the vector of the question, and then they would each also have metadata “question”:“Who are you” and “answer”:“I am me” with the same json keys (question and answer) and unique json values for each index item (the specific unique questions and answers). This aggregated semantic search article aligns somewhat to what we’ve been setting up: Semantic Search .

Could you provide more info on what you would envision with just storing the vectors and not the metadata? Would that expect that, for example, the raw question and answer text is stored in a different database with the pinecone index ID as a reference column, and after the pinecone semantic search the raw Q&A are pulled from that other DB to be presented to the ‘asker’? Or is there some other place within the pinecone index item that the raw Q&A text could be included?

So yea to summarize - the use case from the pinecone portion is to see if there is a similar question to what a bot user is currently asking in our pinecone data set, and if so return the ‘curated’ answer that was associated with that matched question. The heart of the question here is how we can pull questions where the metadata matches a keyword, and then the edit and upsert mechanisms i do have sorted out. There may very well be a better approach to this than having those Q&A’s carried via the metadata and I would love to hear how you might recommend this be solved more elegantly!

Thanks again for reading and for your feedback!

Cory_Pinecone · February 13, 2023, 6:39pm

For the use case - we are an integration company and to explore this tech I am building a slack bot. When a user asks a question the bot design will first check for a similar question via openai embeddings and this pinecone data set. If it doesn’t find a close match the bot design will fall back to a natural language answer via openai completions. It’s still very much in the fiddling and testing phase with a small group – nothing production and we are favoring speed of iteration over perfect accuracy in the dataset at this phase.

Nice mix of existing embeddings and NLP answers via GPT. I like it.

The idea is that the admins will have an option to search replace an answer with a different better one (in a UI that we are already building – the key question is just the API call). The group of admins are experts on different aspects of the question/answer set. We are initially crowd sourcing a big set of q&a data but would also like the ability to let e.g. our expert on EDI improve an EDI answer that someone else may have provided. The main focus will be on updating the metadata but I might also change the vector of a given index ID. For clarity I have the update mechanisms sorted out, its just the ability to give these folks a keyword sort of lookup mechanism via the API, to find specific items based on their keyword being in the metadata, that has me stumped currently.

Ok I see what you’re going for here. Letting the experts refine the answers over time does make sense. And storing tokens/keywords as metadata is one way to do it. I wonder if it will be the most performant approach, though. We have a new sparse/dense vector index under development, it should be in public preview in the next couple of weeks. I think using that approach will be much more efficient in the long run for finding specific vectors that match a keyword, rather than filtering on metadata. Metadata filtering can be very fast but is really only designed for values that have a low cardinality. If each vector has their own set of metadata/keywords, that can have an impact on both performance and number of vectors you can fit in a single pod (meaning more pods per index, and higher costs overall).

Keep an eye on our newsletter and learning center, and follow our LinkedIn page. That’s where we’ll announce when the private preview for sparse/dense vectors is available.

In the meantime, this is an overview of a very popular sparse/dense model: https://dl.acm.org/doi/10.1145/3404835.3463098

We also have an article on the difference between sparse and dense vectors: https://www.pinecone.io/learn/dense-vector-embeddings-nlp/#dense-vs-sparse-vectors

chris.cappetta · February 14, 2023, 9:24pm

Got it thanks very much for the feedback! That all makes sense.