Today, we run LLM evaluations in a very adhoc way either on our own or using OpenAI’s evaluation tool, in theory I would like to plugin in my prompts and ground truth dataset with my vector database to run evals to understand impact of prompt engineering or LLMs (gpt4o vs gpt o1) in relation to our existing vector database (knowledge base)
Hi kavita1, welcome back to the forum!
That’s a great idea! We will consider this going forward!
For now, you should consider looking into platforms like Braintrust, Arize AI, and Galileo, which offer tooling to do exactly what you are describing. These tooling platforms typically support exactly what you need, with the ability to compare models and outputs and run experiments on iterating parameters.
Would you like to see a guide from us showing how to do this specifically with Pinecone in the future?
Sincerely,
Arjun
Thanks @arjun , I have scoped it down to use OpenAI’s evaluation tools in their developer playground and they work well for our usecase. I would like to keep my AI tech stack foot print limited in the early stages for speed and increased transparency and debugging. Yes I would love for Pinecone to have a simple guide/ feature to connect prompts and models to the knowledge base in Pinecone instead of developers having to reinvent the wheel across multiple platforms.