Expected query latency on serverless

Hello, I wanted to ask - what is the expected query latency for Pinecone Serverless and what does it depend on? Does it depend on namespace/collection size, dimensionality, etc? Is it expected to improve once Serverless reaches later stages?
Currently while playing with Serverless, we’re getting ~300ms-400ms on 384d vectors with collection size around 500. This is a very small test set and I was wondering if we should expect higher latencies once we index more data.
Thank you!

In my own testing on a similar dataset I’m getting around 80ms consistently, other than a cold first couple of requests taking around 300-400ms.

Given that serverless is currently limited to us-west-2, one question would be how close are you to us-west-2? I’m on the US west coast, so it’s relatively close to me.

Interesting! We are on us-east-1 AWS, but I wouldn’t expect the latency to be that huge? That said, when I was testing from my office (EU), I was getting 800ms+ latencies and I have no idea why! EU<->US latency should be cca ~100ms I believe.
I wasn’t previously testing with many requests, just a few, but even when I tested 1k requests over cca 3 minutes, I am consistently getting around 330ms per query (just Pinecone query, ignoring other overhead we have).
But I am just noticing, even hitting our index host with curl https://XXX.svc.apw5-4e34-81fa.pinecone.io, I am getting around 300ms, which seems higher than I’d expect!

I am not a Pinecone insider, so this is just my outside perspective, but if I were in your shoes here is what I’d do.

First, I think it’s safe to say that serverless only being available within us-west-2 is temporary. I’d expect in the near future it will be possible to provision serverless environments in other regions as well as multi-region deployments.

For right now, I don’t think it would be worth it to try and troubleshoot the request times you’re seeing because it’s unclear where that time is going: is it network time or is internal Pinecone processing time?

So I would try to eliminate as much uncertainty due to network topography as possible, by moving the testing environment closer to the data so that it reflects more of a realistic production deployment scenario.

If you do that and still find that you’re getting lesser performance than you’d expect, then you’ve at least eliminated the network topography and can focus on Pinecone performance, which would include evaluating the request that you’re making and how much data you’re receiving in response. (For example, are you requesting vectors to be returned when you don’t need them? How much metadata in response? What is your top_k, etc.).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.