Retrieval of irrelevant images for every query

Hey guys!
I am working on a multimodal rag for complex pdfs, but i am facing a problem regarding retrieval of irrelevant images for every query.

Problem : when i ask a query that do not require any image as answer, the model sometimes return random images (from uploaded pdf) for those queries. I checked LangSmith traces, this happens when documents with images are retrieved from the pinecone vectorstore, the model doesn’t ignore the context and displays images anyway.

This happens for even simple query such as “Hello”. For this query, i expect only “Hello! How can I assist you today?” as answer but it also returns some images from the uploaded documents along with the answer.

Architecture:

For texts and tables: embeddings of the textual and table content are stored in the vectorstore

For images: For text and tables : Summaries are stored in the vector database, the original chunks are stored in MongoDBStore. These 2 are linked using doc_id

For images : Summaries are stored in the vector database, the original images chunks ( i.e. images in base64 format ) are stored in MongoDBStore , these 2 are also linked using doc_id.

I’m using MultiVectorRetriever by Langchain.

retriever = MultiVectorRetriever(
vectorstore=vectorStore,
docstore=docstore,
id_key=id_key,
)

Hi shikhar.crpf, welcome to the forum!

It sounds like there may be some abstraction in the langchain MultiVectorRetriever that is querying the vector db for images even when you don’t need them for a given query.

This would explain why irrelevant images are being retrieved. Your system is querying the DBs all the time and is simply returning what is most “relevant,” even when the most relevant thing isn’t so relevant at all!

You may need to implement query routing/classification to determine what queries need what vector DBs, or to not query at all. Sometimes users use agentic workflows to do this too.

Otherwise, it may be wise to ask this question in the Langchain forums, as they can help you identify what Langchain is doing to elicit this behavior.

Finally, if you are interested in implementing image-text embedding search, we did a fun spin on this with Anthropic at our webinar. We chopped up YouTube videos into images and text, preprocessed them with Claude, and then upserted them into Pinecone. That may help inspire your use case!

Watch here: https://www.youtube.com/watch?v=u-ocR-2P_YA

If you get the answer from Langchain, please feel free to update your post here to help others with similar issues.

Sincerely,
Arjun