Long document similarity search

sandraann1294 · May 6, 2024, 3:34am

I am trying to use RAG for a resume and job description match. I have chunked the resumes and the embeddings are stored in a namespace ‘Resume’ in my pinecone index. Is there a way that the input JD can be chunked and embeddings stored in a different namespace and then run a similarity search to retrieve the most similar resumes? My concern is as there are different chunks, will a chunk of JD get matched to an irrelevant resume. I dont know how to go about with this. Please share any ideas to implement this. Anything would be appreciated.

ZacharyProser · May 7, 2024, 5:01pm

Hi @sandraann1294, welcome to the Pinecone forums and thanks for your question!

Thinking out loud here - it sounds like your end goal is to match resumes to Job Reqs based on their similarity. Do I have that correct? If so, here’s an approach you can try:

Namespace Setup: Continue using the ‘Resume’ namespace for storing chunks of resume embeddings. Create a separate namespace, perhaps named ‘JobDescription’, for the embeddings of JD chunks. This segregation will help you manage and search within specific datasets easily.
Chunking and Embedding JDs: When you chunk the JDs, make sure the chunk size closely matches that of the resumes for consistency in the type of information each chunk holds. You can use a similar embedding process for JDs as you have for resumes, ensuring that the embeddings are compatible for similarity comparisons.
Similarity Search Setup: After storing the embeddings in the respective namespaces, you can perform a similarity search by querying each JD chunk against the ‘Resume’ namespace. To handle the issue of a JD chunk potentially matching an irrelevant resume chunk, consider the following approaches:

Aggregate Scoring: Instead of evaluating the similarity on a per-chunk basis, aggregate the similarity scores of all chunks from a single JD across multiple resumes. This can be done by averaging scores or considering the maximum similarity score for each resume as its representative score.
Semantic Search Enhancements: Implement more advanced semantic search techniques if you find simple vector similarity lacking. Techniques like contextual embeddings from models fine-tuned on job/resume data can provide more meaningful similarity measures.

Filtering and Ranking: After obtaining the aggregate scores, sort the resumes by their relevance and present the top matches. You can also implement filters based on non-embedding criteria (like location, experience level, etc.) to refine the search results further. Have a look at our metadata filtering functionality to see if something like this might help you accomplish a more nuanced search
Feedback Loop: Incorporate a feedback mechanism where the results of the matches are evaluated by end-users. Use this feedback to continuously tune the chunking, embedding, and matching algorithms.

This approach ensures that even if individual chunks might not find their perfect counterpart, the overall matching process considers the cumulative evidence across multiple chunks to make a more informed and relevant connection between JDs and resumes.

Here’s a couple of other thoughts / pointers to some resources:

We have an official guide to using namespaces here for some examples, and we just published a new chapter in our Vector DBs for Busy Engineers Series that focuses on multi-tenancy with Pinecone, which may be of interest as well.
We have a number of open-source example Notebooks, such as this one demonstrating similarity search with Pinecone. If you start from the learn directory you’ll find a bunch of different examples - hopefully you’re able to lift some useful patterns from here.