Hi, great to meet you, I’m Myles and I’m marching and moving!
The journey started over 18 months ago:
I now have completed inference summaries for a series of vocational training videos and I want to create a chatbot that can utilize these summaries to allow users to pull from a rich (and growing) corpus of proprietary knowledge built up over three decades in the wheelchair custom seating vertical.
I don’t feel ready for model fine-tuning yet, mostly due to the risks of catastrophic forgetting and over/underfitting. I have moved beyond the limits of prompt engineering and am looking for a way to empower a user base to access this proprietary data set in a natural and conversational manner.
I hope to do multimodal embeddings using CLIP in which I combine textual structured summary data with video keyframes to best allow users to perform both textual and visual queries on the dataset.
I didn’t see any obvious examples in the GitHub - pinecone-io/examples: Jupyter Notebooks to help you get hands-on with Pinecone vector databases repository, so I’m asking here if anybody has done this before. If not, then I’ll be happy to do so and contribute knowledge back. Just looking for hints and best practices here.
I have granular structured data summaries over every video in a particular series of videos, each with an AI-generated description that combines multimodal inputs (AI-transcribed spoken English and keyframes) with applicable timestamps and skill lists applicable to each summary block. E-learning content distilled from each video is also available with applicable timestamps applying to each question. Series-level e-learning and summary content is aggregated and will be a candidate for embedding as well. I produce metadata of keywords, products referenced, skills demonstrated and search terms on a per-video basis and a per-series basis.
All textual summary content is made available in several other languages that are common in the predicted user base.
Thanks !