How to query all vectors in Pinecone index?

I have millions of vectors in a Pinecone index.

Goal: Cluster these vectors and retrieve the top n clusters.

I can achieve this using DBSCAN from sklearn. But I need all vectors in memory to do that.

Question: How can I query for all vectors in the Pinecone index? Is this even the right way to be approaching this problem?

1 Like

Hi @pzsh,

First, welcome to the Pinecone community!

When you say you want to cluster your vectors, what are you solving by doing so? If you could give more insight into what you’re doing we can come up with a more vector-friendly way of doing that.

You shouldn’t need to query for all vectors in the database. If you’re looking for clusters of matching vectors you should be using queries in the index to do that for you.

Cory

Hi Cory!

The goal is to label reviews. I don’t know what the labels are going to be ahead of time.

My first approach was to run DBSCAN on the vectors, choose the top 50 clusters, and then analyze what was clustered together and label them appropriately: “slow to load”, “used daily”, “crashing”, “too expensive”, etc

This might be a good fit for a new feature we’re working on, collection IO. If you don’t object, I can ask one of our PMs to email you more details.

Happy to chat with them. You can use the email for this account.

Hey there, I’m looking to do the same thing - clustering. What’s the timeline for releasing this as a public feature? Thanks!

Hi @maxime ,

We don’t have a publicly available timeline yet. But keep an eye on the monthly newsletter, we’ll announce any big releases there.

Hi @Cory_Pinecone,
Just following up to see if there are any updates on this.

One work-around I see is that many clustering algorithms allow you to enter a custom distance matrix. So if I could get around the 10,000 top_k maximum, then I could generate my own cosine distance matrix from all of my vectors. Is there anything like this currently available?

Thanks in advance!

Hi @Inegoita,

No, nothing to share yet. We’re also working on ways to do what you mention with the custom distance metric, but more around setting minimum and maximum similarity score values. So you could bracket your results (e.g., give me all vectors between .7 and .8 confidence). Would that do what you’re looking for?

Would you mind answering the original question please? I have the same clustering use case, and haven’t seen any updates on this, so it would be easier if I just load all the vectors and do it in Python.