What are the biggest challenges you've faced when scaling vector search for real-time applications?

As part of the Pinecone team, I get to hear from a lot of folks working on real-time vector search. A few common themes keep coming up: managing query throughput, handling frequent index updates, and optimizing memory usage. I’ve also heard people experimenting with things like sharding strategies and embedding compression to help.

I’d love to open this up—
What challenges have you run into when trying to scale vector search for real-time use cases? What’s worked (or hasn’t)?

Scaling vector search for real-time applications seems to be a common pain point for many of us. Between managing query throughput, handling frequent index updates, and optimizing memory usage, there’s a lot to juggle. I’ve seen discussions around sharding strategies and embedding compression as potential solutions, but I’d love to hear what’s worked for you. What challenges have you faced, and how did you approach them?