Guidance on making asynchronous queries


I was wondering if there documentation on how to conduct asynchronous queries of data (just like the example for upserts in parallel. My index is relatively small (20k vectors) but I need to query millions of ‘unseen’ vectors against the index to find similar vectors.


1 Like

Hi @jltparc,

First, welcome to the Pinecone forums!

Queries are run in the order they’re received, and are non-blocking read operations. So you shouldn’t have to include any asynchronous logic when running them.

Can you share more details about the types of queries you’re running? If you’re running hundreds at a time we don’t have a mechanism for batched reads like with upserts; there’s a deprecated method in the docs to query multiple vectors simultaneously but we don’t recommend using that.

You might be able to average your vectors in groups, run a single read on the resulting vector, then only query the constituent vectors if that one meets a certain threshold. This may take some experimenting to get right (if your outliers are too close to your corpus of expected values they may get drowned out if they’re averaged with too many expected values).

Hi @Cory_Pinecone – thanks for the response. To clarify non-blocking read operations, it seems that index.query is a blocking operation in the current version of the Python SDK, and (as you mention) the batched query system seems to be deprecated.

As such, in a use case where a large of queries need to be made at once, the naive approach does not allow any parallelization. Is the guidance to use the built-in Python primitives for parallelism (asyncio, threadpool, etc), or is there any Pinecone specific recommendations? I’m not able to find much in the documentation and SDK source code, so I imagine the Python builtins are my best bet, but wanted to check.



You were able to find any best practices here?

I would also want to be able to await an index.query. Making it blocking greatly reduces performance when lots of sequential queries are necessary.

1 Like

i am querying against over 3million input embeddings. so i need this to be as fast as possible. so parallel querying (e.g. multiple pods or replicas??) would be very helpful. a best practices code in python would be very useful

I made changes to support async querying. It turns out the underlying HTTP library does support this (through multiprocessing pool), if you pass async_req=True to query. However, the pinecone library’s Index::query method doesn’t handle the AsyncResult that comes back, as they attempt to parse_query_response assuming it’s actually a response instead of an AsyncResult. So I inherited off of Index and made my own class.

You can see the approach here:

It returns an ‘AsyncHandle’ (wrapper class in the gist), so you can call ‘get’ when you want to wait for the results. I didn’t have an easier way to chain this on short notice.

To use, instantiate AsyncIndex with num_threads to be like 30 or something. Remember you want to keep the index around.