Vector Index Dimensions for text and text + image data

kavita1 · October 27, 2024, 12:20am

We have a use case to process pdf’s and ppt’s with images and text and we use 2 different embedding models - openai/clip-vit-base-patch32 and text-embedding-3-small which have different dimension sizes 512 & 1536 respectively. Currently, this cannot be saved in a single index. I am curious what are some alternate ways of processing this data at scale ( 1GB + data )
There is no way to manually check the files for images and/or text so we use a step in processing to choose the right embedding model based on the contents of a file. Appreciate any best practices or anything that has worked for production use cases.

arjun · November 22, 2024, 10:48pm

Hi kavita1, welcome to the forum!

Funny enough, we covered this exact use case in a recent webinar with Anthropic covering chatting with PDFs from recorded webinars using Pinecone, contextual retrieval with Claude, and AWS Bedrock.

Check that out here, and the repo associated with that workflow here. You may find that extremely helpful!

It looks like you are using a CLIP style model, which should be able to accept image and text input to embed into the same vector space. Why did you decide to embed the images and text separately? If you use CLIP to embed the text and images at the same time, you can get away with using one index instead of two, and also storing all this info in the same vector db.

Or, consider using a classifier that is trained to look for images within pdfs, and use that to route the text to the appropriate model.

Finally, you could try to just turn ALL the text into images, by screenshotting the pdfs, kind of like we did in the video above!

Claude/Anthropic has guidance on how to do this specifically with powerpoints

Here’s the model page from HF, which shows how to do this

Hope this helps!
-Arjun

kavita1 · December 11, 2024, 8:29pm

@arjun Thank you for the webinar link covering the use case, initial thought was our text embedding model works perfectly so why change and hence introduce a new embedding model only for images. We will rethink that in the near future as we scale.

arjun · December 11, 2024, 9:01pm

Sounds great, and good luck! Feel free to follow up with more questions later