Hi all,
I inserted multiple “curriculum vitae” pdf files in pinecone. All “curriculum vitae” are almost about the same job.
I created a bot but when I load and ask my pinecone index, the bot doesn’t answer properly. It knows some info about the CV but not the name of the person for example…
I guess it’s perhaps about the way to insert data in Pinecone, no ?
Semantic search is going to find the items in your DB that are the most similar (in meaning) to the query, so it’s important to understand how your data will be queried in order to plan how your data should be indexed to support those queries.
You mentioned that all of your CVs are relevant to the same job, so likely they do contain somewhat similar info already, for example they might list similar skills, etc.
But, it’s unclear whether the vector search is under-performing or whether it’s the bot that’s using the search results that is under-performing.
Some questions to think about:
What type of queries do you need to support? Just look at the performance of the vector search on those types of queries. Are they returning the search results you expect? Once you’re happy with how the vector search is performing, then you can move on to consider the functionality of the bot itself, like the quality of the answers it’s giving.
Are you “chunking” the CVs i.e. breaking them up into smaller pieces? If you only create one vector embedding for an entire CV, it’s very unlikely to perform well against your queries, which might only relate to a small part of the CV.
What additional content (metadata) are you storing associated with each CV? You mentioned it doesn’t know the name of the person. Why would you expect it to know the name of the person unless you’re storing that detail in a way that it could reliably retrieve it? Without knowing the details of your implementation it’s hard to guess at what could be going on here.
Knowing that the content consists entirely of CVs, I would suggest to chunk by paragraph, as each paragraph in a CV is likely to be topical and is unlikely to be “too large”, so the semantic search should perform well.
You can associate whatever known data you have about each CV (name, location, job history, etc – basically whatever info would normally come through during the application process). You could put this associated data into Pinecone as metadata, but there’s really no reason you have to put it there.
The main reason that you might want to include the additional metadata is if you want to do metadata filtering on the search results, for example if you wanted to limit your search to only include those CVs where the person is located in San Diego.
If you don’t need to do metadata filtering, then you could keep this associated data (and probably the entire original CV content) available in a more traditional data store, and you could just keep a pointer to the data in Pinecone so that when you’re processing a search result, you can use the pointer to retrieve the associated content.
You said that if I don’t need to do metadata filtering, I can keep the entire original CV content in a traditional data store. I understand but what do we have in Pinecone if all is in the traditional data store ? Just pointers ?
Thank you very much in advance. It’s very helpful.
The only reason to put anything in pinecone is to enable semantic similarity search, so the main thing you keep in pinecone are the embeddings of your content that allow you to find it when you have some other piece of similar content (the “query”).
For storing non-embedded data, Pinecone can be fairly expensive compared to other non-vector DBs, so I’d suggest thinking through what you need to actually store in Pinecone to enable semantic search and what you can avoid storing there in order to save $$.