Looking for best practices for multimodal embeddings for semantic search

mdear · December 29, 2024, 9:51pm

Hi, great to meet you, I’m Myles and I’m marching and moving!

The journey started over 18 months ago:

I now have completed inference summaries for a series of vocational training videos and I want to create a chatbot that can utilize these summaries to allow users to pull from a rich (and growing) corpus of proprietary knowledge built up over three decades in the wheelchair custom seating vertical.

I don’t feel ready for model fine-tuning yet, mostly due to the risks of catastrophic forgetting and over/underfitting. I have moved beyond the limits of prompt engineering and am looking for a way to empower a user base to access this proprietary data set in a natural and conversational manner.

I hope to do multimodal embeddings using CLIP in which I combine textual structured summary data with video keyframes to best allow users to perform both textual and visual queries on the dataset.

I didn’t see any obvious examples in the GitHub - pinecone-io/examples: Jupyter Notebooks to help you get hands-on with Pinecone vector databases repository, so I’m asking here if anybody has done this before. If not, then I’ll be happy to do so and contribute knowledge back. Just looking for hints and best practices here.

I have granular structured data summaries over every video in a particular series of videos, each with an AI-generated description that combines multimodal inputs (AI-transcribed spoken English and keyframes) with applicable timestamps and skill lists applicable to each summary block. E-learning content distilled from each video is also available with applicable timestamps applying to each question. Series-level e-learning and summary content is aggregated and will be a candidate for embedding as well. I produce metadata of keywords, products referenced, skills demonstrated and search terms on a per-video basis and a per-series basis.

All textual summary content is made available in several other languages that are common in the predicted user base.

Thanks !

arjun · January 2, 2025, 9:20pm

Hi Myles and welcome to the Pinecone forum!

Fantastic question and really cool project! You are in luck. We just completed a webinar with Anthropic on a related topic: visual question answering using contextual retrieval over webinar videos.

In a sense, this may fit some of your requirements, as the input data was taking frames of the video, along with descriptions of those frames from an LLM, and indexing them into our vector database so you can build a RAG workflow over them. In your case, the frames would be the video keyframes and the text data would be the summaries.

The key difference between this example and yours is that we use Claude to describe the videos, and also use Claude again to interpret the frames, associated summaries, and additional user queries to assist the search, whereas you say you want to apply CLIP. Both methods allow for visual question answering; this is mostly dependent on your input modality and what embedding model you use, which Pinecone is agnostic to.

Take a look at the webinar youtube video here: https://www.youtube.com/watch?v=u-ocR-2P_YA

And the repo here: GitHub - pinecone-io/contextual-webinar-rag: Contextual RAG over webinar videos using Pinecone, Claude and AWS.

We also have a demo of multimodal search that indexes videos, images and text across a fashion dataset: https://www.youtube.com/watch?v=v5b-3-4NibI

Hope this helps, and please follow up with any additional questions you may have!

Sincerely,
Arjun