How to query all vectors in Pinecone index?

pzsh · January 9, 2023, 1:24am

I have millions of vectors in a Pinecone index.

Goal: Cluster these vectors and retrieve the top n clusters.

I can achieve this using DBSCAN from sklearn. But I need all vectors in memory to do that.

Question: How can I query for all vectors in the Pinecone index? Is this even the right way to be approaching this problem?

Cory_Pinecone · January 9, 2023, 5:01pm

Hi @pzsh,

First, welcome to the Pinecone community!

When you say you want to cluster your vectors, what are you solving by doing so? If you could give more insight into what you’re doing we can come up with a more vector-friendly way of doing that.

You shouldn’t need to query for all vectors in the database. If you’re looking for clusters of matching vectors you should be using queries in the index to do that for you.

Cory

pzsh · January 9, 2023, 6:15pm

Hi Cory!

The goal is to label reviews. I don’t know what the labels are going to be ahead of time.

My first approach was to run DBSCAN on the vectors, choose the top 50 clusters, and then analyze what was clustered together and label them appropriately: “slow to load”, “used daily”, “crashing”, “too expensive”, etc

Cory_Pinecone · January 9, 2023, 8:43pm

This might be a good fit for a new feature we’re working on, collection IO. If you don’t object, I can ask one of our PMs to email you more details.

pzsh · January 9, 2023, 9:13pm

Happy to chat with them. You can use the email for this account.

maxime · February 2, 2023, 3:02am

Hey there, I’m looking to do the same thing - clustering. What’s the timeline for releasing this as a public feature? Thanks!

Cory_Pinecone · February 2, 2023, 5:01pm

Hi @maxime ,

We don’t have a publicly available timeline yet. But keep an eye on the monthly newsletter, we’ll announce any big releases there.

lnegoita · March 6, 2023, 10:08am

Hi @Cory_Pinecone,
Just following up to see if there are any updates on this.

One work-around I see is that many clustering algorithms allow you to enter a custom distance matrix. So if I could get around the 10,000 top_k maximum, then I could generate my own cosine distance matrix from all of my vectors. Is there anything like this currently available?

Thanks in advance!

Cory_Pinecone · March 17, 2023, 3:42pm

Hi @Inegoita,

No, nothing to share yet. We’re also working on ways to do what you mention with the custom distance metric, but more around setting minimum and maximum similarity score values. So you could bracket your results (e.g., give me all vectors between .7 and .8 confidence). Would that do what you’re looking for?

amar · May 1, 2023, 5:08pm

Would you mind answering the original question please? I have the same clustering use case, and haven’t seen any updates on this, so it would be easier if I just load all the vectors and do it in Python.

ed11 · July 13, 2023, 5:04am

Hi @Cory_Pinecone ,
Is there any update on this functionaity. I’ll explain my use case in a bit more detail incase there is another way to do it i’m missing…

I have about 90k titles of different meetings that I want to tag (e.g “Basketball with friends” → Social). At this stage I don’t have a predefined list of tags so instead I want to index all the vectors for all 90k titles and then look for similar groups of meeting titles. Once I have these I will feed them back into the LLM and ask it to give me a good tag description of all the similar titles.

Does that make sense?

Thansk,
Ed.

jensen · September 18, 2023, 9:38pm

Any update on this? @Cory_Pinecone

Cory_Pinecone · October 8, 2023, 4:41pm

No, sorry, no updates. Please keep an eye on our release notes to know when this feature is available.

ayansengupta17 · January 27, 2024, 4:31am

Hi, I was wondering if pinecone is still working on this? Will there be any release or this feature is not a priority right now?

Thanks in advance