My PDF RAG app isn't able to return correct documents for a query , what may be the reason?

gauravmindzk · February 7, 2025, 10:15am

Hello everyone,
I’m currently developing a PDF RAG app and running into a problem .

Let’s understand my app workflow.

A user uploads a PDF , clicks to ‘Process’ it.
I’ve used pymupdf4llm as the pdf parser. It effectively stores all the textual data of the pdf file as a string and all the images from the pdf into a separate folder.

Then, I make use of Semantic Chunking to chunk the pdf textual data that is stored in the string variable.

After this, I create summaries of text chunks and the pdf images.

I store both the summaries ( text and image ) in pinecone and the actual images and text chunks ( generated using semantic chunking ) in MongoDB doc store.

For retrieval I make use of Langchain’s MultiVectorRetriever.

When a user uploads a pdf, processes it and asks questions , then many times the documents ( that pinecone returned ) are not even relevant.

What may be the reason ?

I’m using gpt-4o-mini as the LLM , OpenAIEmbedding-3-large as the embedding model .

Is this happening because of “Curse of Dimensionality” ?

When debugging, I came across Pinecone docs

In fact, in some cases, a short document may actually show higher in a vector space for a given query, even if it is not as relevant as a longer document. This is because short documents typically have fewer words, which means that their word vectors are more likely to be closer to the query vector in the high-dimensional space. As a result, they may have a higher cosine similarity score than longer documents, even if they do not contain as much information or context. This phenomenon is known as the “curse of dimensionality” and it can affect the performance of vector semantic similarity search in certain scenarios.

Reference : Differences between Lexical and Semantic Search regarding relevancy - Pinecone Docs

Because I use Semantic Chunking as the document chunking method, some of my text chunks are really small ( some comprise of 5-7 words also ) and if I take note of the above quote from the documentation, it looks like it is indeed because of “curse of dimensionality”

What do you guys think , is “Curse of dimensionality” really the reason in my case ?

How can I resolve this issue ? Should I reduce the number of dimensions when creating and storing vectors from the default of OpenAIEmbedding-3-large ( i.e. 3072 ) to 1024 or something ?

perry · February 11, 2025, 7:02am

Hi @gauravmindzk and thanks for reaching out!

In short, no I don’t think reducing the dimensionality is going to help here since the issue is more likely related to your small chunk sizes. This is likely making it hard for the embedding model to extract enough semantic meaning between the chunks and your questions to match them accurately.

I would suggest increasing the chunk size. You might want to first try with a simpler chunking strategy like paragraph boundary or 500 tokens. Semantic chunking is best when the “meaning” of text does not fit nicely into an obvious boundary. In your case, I’m not sure why it’s causing such small chunk sizes but that’s also something we could look into. You can read more about our common chunking strategies here: Chunking Strategies for LLM Applications | Pinecone

Another approach (in addition to the above) could be to look at creating “sparse” embeddings along with your dense ones. This will match exact keywords instead of semantic meaning and can be a great way of augmenting your search when users are looking for specific terms. You can read more about our latest developments in this area here: Introducing cascading retrieval: Unifying dense and sparse with reranking | Pinecone