I have millions of vectors in a Pinecone index.
Goal: Cluster these vectors and retrieve the top n clusters.
I can achieve this using DBSCAN from sklearn. But I need all vectors in memory to do that.
Question: How can I query for all vectors in the Pinecone index? Is this even the right way to be approaching this problem?
First, welcome to the Pinecone community!
When you say you want to cluster your vectors, what are you solving by doing so? If you could give more insight into what you’re doing we can come up with a more vector-friendly way of doing that.
You shouldn’t need to query for all vectors in the database. If you’re looking for clusters of matching vectors you should be using queries in the index to do that for you.
The goal is to label reviews. I don’t know what the labels are going to be ahead of time.
My first approach was to run DBSCAN on the vectors, choose the top 50 clusters, and then analyze what was clustered together and label them appropriately: “slow to load”, “used daily”, “crashing”, “too expensive”, etc
This might be a good fit for a new feature we’re working on, collection IO. If you don’t object, I can ask one of our PMs to email you more details.
Happy to chat with them. You can use the email for this account.