Vector Index Dimensions for text and text + image data

We have a use case to process pdf’s and ppt’s with images and text and we use 2 different embedding models - openai/clip-vit-base-patch32 and text-embedding-3-small which have different dimension sizes 512 & 1536 respectively. Currently, this cannot be saved in a single index. I am curious what are some alternate ways of processing this data at scale ( 1GB + data )
There is no way to manually check the files for images and/or text so we use a step in processing to choose the right embedding model based on the contents of a file. Appreciate any best practices or anything that has worked for production use cases.