When you say you want to cluster your vectors, what are you solving by doing so? If you could give more insight into what you’re doing we can come up with a more vector-friendly way of doing that.
You shouldn’t need to query for all vectors in the database. If you’re looking for clusters of matching vectors you should be using queries in the index to do that for you.
The goal is to label reviews. I don’t know what the labels are going to be ahead of time.
My first approach was to run DBSCAN on the vectors, choose the top 50 clusters, and then analyze what was clustered together and label them appropriately: “slow to load”, “used daily”, “crashing”, “too expensive”, etc
This might be a good fit for a new feature we’re working on, collection IO. If you don’t object, I can ask one of our PMs to email you more details.
Hi @Cory_Pinecone,
Just following up to see if there are any updates on this.
One work-around I see is that many clustering algorithms allow you to enter a custom distance matrix. So if I could get around the 10,000 top_k maximum, then I could generate my own cosine distance matrix from all of my vectors. Is there anything like this currently available?
No, nothing to share yet. We’re also working on ways to do what you mention with the custom distance metric, but more around setting minimum and maximum similarity score values. So you could bracket your results (e.g., give me all vectors between .7 and .8 confidence). Would that do what you’re looking for?
Would you mind answering the original question please? I have the same clustering use case, and haven’t seen any updates on this, so it would be easier if I just load all the vectors and do it in Python.