Thanks a lot for the warm welcome and quick response!
Sure, I’d be happy to provide more context.
I’m building a web-based assistant for healthcare professionals in Turkey — specifically for private hospital managers and medical billing staff. The assistant answers questions about regulations, reimbursement rules, and contract details based on official healthcare legislation (like the SUT and related annexes).
Here are some rough figures:
Around 30-40 documents, mostly in PDF and DOCX format, totaling about 30-40 MB.
After chunking and embedding, I ended up with roughly 2500-3000 vectors.
I’m using Pinecone (Starter or Standard plan) + OpenAI for embeddings and responses.
User requests are relatively simple — mostly short queries, and I expect low concurrency, maybe a few hundred queries per week at most in the beginning.
That said, my monthly cost has been higher than expected, especially when combining Pinecone vector storage with the Assistant’s LLM-based processing. I just want to make sure I’m not overpaying for something that could be optimized.
Are there any best practices to reduce costs in this setup? Like:
Using a more storage-efficient vector DB?
Reducing redundancy in the chunks?
Running embeddings differently?
I’d also love to hear if anyone has used a hybrid setup (e.g., local embedding + external retrieval) to cut down expenses.
Thanks again!