I have millions of vectors in a Pinecone index.
Goal: Cluster these vectors and retrieve the top n clusters.
I can achieve this using DBSCAN from sklearn. But I need all vectors in memory to do that.
Question: How can I query for all vectors in the Pinecone index? Is this even the right way to be approaching this problem?
First, welcome to the Pinecone community!
When you say you want to cluster your vectors, what are you solving by doing so? If you could give more insight into what you’re doing we can come up with a more vector-friendly way of doing that.
You shouldn’t need to query for all vectors in the database. If you’re looking for clusters of matching vectors you should be using queries in the index to do that for you.
The goal is to label reviews. I don’t know what the labels are going to be ahead of time.
My first approach was to run DBSCAN on the vectors, choose the top 50 clusters, and then analyze what was clustered together and label them appropriately: “slow to load”, “used daily”, “crashing”, “too expensive”, etc
This might be a good fit for a new feature we’re working on, collection IO. If you don’t object, I can ask one of our PMs to email you more details.
Happy to chat with them. You can use the email for this account.
Hey there, I’m looking to do the same thing - clustering. What’s the timeline for releasing this as a public feature? Thanks!
Hi @maxime ,
We don’t have a publicly available timeline yet. But keep an eye on the monthly newsletter, we’ll announce any big releases there.
Just following up to see if there are any updates on this.
One work-around I see is that many clustering algorithms allow you to enter a custom distance matrix. So if I could get around the 10,000 top_k maximum, then I could generate my own cosine distance matrix from all of my vectors. Is there anything like this currently available?
Thanks in advance!
No, nothing to share yet. We’re also working on ways to do what you mention with the custom distance metric, but more around setting minimum and maximum similarity score values. So you could bracket your results (e.g., give me all vectors between .7 and .8 confidence). Would that do what you’re looking for?
Would you mind answering the original question please? I have the same clustering use case, and haven’t seen any updates on this, so it would be easier if I just load all the vectors and do it in Python.
Hi @Cory_Pinecone ,
Is there any update on this functionaity. I’ll explain my use case in a bit more detail incase there is another way to do it i’m missing…
I have about 90k titles of different meetings that I want to tag (e.g “Basketball with friends” → Social). At this stage I don’t have a predefined list of tags so instead I want to index all the vectors for all 90k titles and then look for similar groups of meeting titles. Once I have these I will feed them back into the LLM and ask it to give me a good tag description of all the similar titles.
Does that make sense?
Any update on this? @Cory_Pinecone
No, sorry, no updates. Please keep an eye on our release notes to know when this feature is available.