A few questions about use Pinecone as production online Vector DB

zhenglaizhang · March 8, 2022, 9:28am

Hi Pinecone experts.

I am now evaluating Pinecone as a candidate for my company’s production Vector DB solution and have the following concerns, so I am writing here to clarify, hoping to get your answers:

What’s the SLA of standard plan and dedicated plan? How to ensure index high availability? As we may use it as recommendation system online serving, so high availability is very important for us.
February Release: Performance, Predictability, and Control | Pinecone tells us the p95 query latency including network roundtrip, do you have the statistics which excludes the network latency, so that I may have better understanding of the pinecone itself’s processing latency.
How is data durability and safety maintained?
Is there any versioning support? I went through docs but didn’t find any clue about it. In our typical flow, we may train our models and embeddings daily and push new embedding of newer versions to VectorDB and online serving may at some point pick up the new embedding version.
Per upsert throttling I am wondering do we have any statistis telling us what’s the criteria that write throttling is triggered, so that I can estimate our write speed and do design accordingly.
Are there any metrics or dashboards for monitoring and alerting purpose?

These are quite a few questions, but they are very important for our production usage, thanks in advance!

dave · March 9, 2022, 3:07pm

Hi! I’m in Product at Pinecone. Responding to your questions:

We offer a high uptime SLA on our dedicated plan. Sales would be happy to discuss. The contract would include a requirement that you have at least 2 replicas and excess throughput capacity for very high availability. Pinecone uses multiple AZs for high availability on the dedicated plan. But Pinecone needs you to have at least 1 replica to leverage the multi-AZ support.
The values are round trip latency, but they are round trip within the same cloud and region. So we expect this is the most impactful value.
Pinecone is durable. We use 3-way replication with flushes to disk.
We don’t have a versioning feature. Other Pinecone customers often use either a namespace or metadata filter to label embeddings by their version and then query within only the version you want.
With our Feb release, for indexes created after Feb 22nd, we’ve actually greatly reduced the chance of getting throttled. I recommend using a batch size of around 100-200 vectors per upsert as shown in the doc. You should be able to upset 5K vectors/sec.
Stay tuned! We plan to release metrics via Prometheus soon.

zhenglaizhang · March 10, 2022, 3:47am

Thanks @dave for the detailed answers, they are very helpful! Regarding #2 and #5 I have following up questions as below:

#2. For the round trip latency value, what’s the throughput when the p95 latency is ~100ms? I didn’t see any info about the max reasonable query throughput that p1 pod can support.

#4. using namespace label as version should work, just double check that we can have the same vector id in different namespaces in the same index?

#5. About “upset 5K vectors/sec”, are you referring to the single p1 pod? And how about s1 pod? Or is this the max speed no matter how many pods we are using?

Also, I am wondering if Pinecone has any benchmark tool that we can reuse to benchmark our scenario (e.g. scale and different dimensions).

dave · March 10, 2022, 10:15pm

Glad it was helpful. To your replies:

#2: Generally speaking, query throughput is about 1/latency. But … now I see why you asked about round trip time. It’s in fact a little better than 1/latency due to the round trip latency which doesn’t contribute as much to throughput. Giving specific numbers is hard. The best way to estimate how much better than 1/latency you can achieve is to test. That said, 1/latency is a reasonable approximation.

#4 Yes. We support using the same ID in different namespaces. Note: this behavior is a little different from other data systems where re-use of IDs across partitions can be prohibited. But sounds like you would value this? A workaround would be to pre-append ID with something like “id-123-model1”. Would that be hard? Again, we support ID re-use across namespaces today, so it’s not necessary. I just ask out of curiosity.

#5 Unfortunately no - we don’t have published benchmark code currently. We have an older benchmark below, but it uses an older version of Pinecone. Nonetheless, it may help give you ideas for how to test.

zhenglaizhang · March 11, 2022, 2:13am

@dave Thanks for your quick response!
#2: Regarding query throughput, currently our online serving may have ~6k/s qps, and we also want to maintain a reasonable latency at the same time.

#3: In our initial scenario, we may fetch the embedding by id, and then use the fetched embedding to do a similarity query. And I am thinking of using one namespace for one version. So the id will be the same for the same target across different namespaces (aka. versions in my cases).

#5: The link is useful to me. Regarding the “upset 5K vectors/sec”, is it one pod’s limitation, or single index’s limitation, or even one project’s limitation?

dave · March 11, 2022, 3:16am

Glad it was helpful! Upserts are faster with more pods and when using the gRPC client. This may double the upsert rate or more. What upsert rate do you plan?

6K QPS will definitely take replicas. And I hear you that at that rate, the latency details matter. I suggest you reach out to support@pinecone.io to get more specifics on optimizing for QPS in your case.

zhenglaizhang · March 11, 2022, 10:05am

Thanks @dave!
I will contact support@pinecone.io for more specific QPS optimization questions