Insert multiples "curriculum vitae" pdf in pinecone

sosdeveloppeur · November 2, 2023, 11:56am

Hi all,
I inserted multiple “curriculum vitae” pdf files in pinecone. All “curriculum vitae” are almost about the same job.
I created a bot but when I load and ask my pinecone index, the bot doesn’t answer properly. It knows some info about the CV but not the name of the person for example…

I guess it’s perhaps about the way to insert data in Pinecone, no ?

What do you think ?

Thank you for your help

silas · November 2, 2023, 4:18pm

Hi @sosdeveloppeur,

Semantic search is going to find the items in your DB that are the most similar (in meaning) to the query, so it’s important to understand how your data will be queried in order to plan how your data should be indexed to support those queries.

You mentioned that all of your CVs are relevant to the same job, so likely they do contain somewhat similar info already, for example they might list similar skills, etc.

But, it’s unclear whether the vector search is under-performing or whether it’s the bot that’s using the search results that is under-performing.

Some questions to think about:

What type of queries do you need to support? Just look at the performance of the vector search on those types of queries. Are they returning the search results you expect? Once you’re happy with how the vector search is performing, then you can move on to consider the functionality of the bot itself, like the quality of the answers it’s giving.
Are you “chunking” the CVs i.e. breaking them up into smaller pieces? If you only create one vector embedding for an entire CV, it’s very unlikely to perform well against your queries, which might only relate to a small part of the CV.
What additional content (metadata) are you storing associated with each CV? You mentioned it doesn’t know the name of the person. Why would you expect it to know the name of the person unless you’re storing that detail in a way that it could reliably retrieve it? Without knowing the details of your implementation it’s hard to guess at what could be going on here.

Hope this helps,
Silas

sosdeveloppeur · November 2, 2023, 4:45pm

Thank you @silas.

My goal is to create a bot to help recruiter agency. The knowledge database is only CVs. I only added the pdf files (CVs) with OpenAI embeddings.

The bot should answer to the questions like “Do you have a Java software developer with minimum 5 years of experience ?”.

You’re right that the bot can’t know the name of the person as in the CV, there is the name of the person but it’s not saying “name = John Doe”…
You’re right again that I should chunk in small pieces perhaps 30 words !?

What is your recommendation to insert the data in Pinecone ? I guess that I have to transform the data but I don’t know which direction should I use.

My idea is :

Extract the text from the pdf file (CV)
Add some information to the text like “name = John Doe”
Chunk in small piece of 30 words
Send to Pinecone

Thank you very much for your help

silas · November 2, 2023, 5:02pm

Knowing that the content consists entirely of CVs, I would suggest to chunk by paragraph, as each paragraph in a CV is likely to be topical and is unlikely to be “too large”, so the semantic search should perform well.

You can associate whatever known data you have about each CV (name, location, job history, etc – basically whatever info would normally come through during the application process). You could put this associated data into Pinecone as metadata, but there’s really no reason you have to put it there.

The main reason that you might want to include the additional metadata is if you want to do metadata filtering on the search results, for example if you wanted to limit your search to only include those CVs where the person is located in San Diego.

If you don’t need to do metadata filtering, then you could keep this associated data (and probably the entire original CV content) available in a more traditional data store, and you could just keep a pointer to the data in Pinecone so that when you’re processing a search result, you can use the pointer to retrieve the associated content.

sosdeveloppeur · November 2, 2023, 5:40pm

Thank you very much @silas. It’s more more clear.

sosdeveloppeur · November 3, 2023, 5:41pm

@silas, May I ask you another question ?

You said that if I don’t need to do metadata filtering, I can keep the entire original CV content in a traditional data store. I understand but what do we have in Pinecone if all is in the traditional data store ? Just pointers ?

Thank you very much in advance. It’s very helpful.

silas · November 3, 2023, 5:46pm

The only reason to put anything in pinecone is to enable semantic similarity search, so the main thing you keep in pinecone are the embeddings of your content that allow you to find it when you have some other piece of similar content (the “query”).

For storing non-embedded data, Pinecone can be fairly expensive compared to other non-vector DBs, so I’d suggest thinking through what you need to actually store in Pinecone to enable semantic search and what you can avoid storing there in order to save $$.

sosdeveloppeur · November 3, 2023, 5:59pm

Thank you, understood