Understanding Vector Index Usage and Embedding Clustering in Pinecone

hmzakhalid · January 18, 2024, 6:42pm

Hello everyone,

I’m delving into the details of indexing and clustering mechanisms. I have a couple of questions and I’m seeking insights from the community.

Identifying the Active Vector Index:

I understand that Pinecone utilizes various vector indexes for efficient data retrieval. Could someone explain how I can determine which specific vector index is being actively used in my database setup?
Are there specific commands or parts of the Pinecone dashboard that provide this information?

Understanding Embedding Clusters:

In the context of these indexes, embeddings or data chunks are clustered for optimized search. How can I identify which embeddings belong to a specific cluster?
Is there a way to visualize these clusters or extract information about the distribution of embeddings across different clusters?

ayansengupta17 · January 27, 2024, 4:33am

Following this question.
I would also like to understand the embedding cluster. If this is a propreitary method, just mentioning that would also help.

Cory_Pinecone · January 29, 2024, 11:02pm

Hi @hmzakhalid. I think there may be a misunderstanding based on the terms being used. Pinecone indexes are distinct services themselves; they’re not an index in a database like you would have in a RDBMs. So you’ll know which index you’re using because you can only run queries against a single index at a time. And indexes don’t share data, they stand alone.

You can see all of your indexes in the Projects page of the console.

We don’t expose the clustering details in your index. You simply connect to the index and, after upserting your data, run your queries using a given vector. We’ll then return the vectors closest to it, up to the limit of what you’ve set for your top_k on that query. It’s designed to be simple to use and easy to get started.

hmzakhalid · January 30, 2024, 2:22pm

Hey @Cory_Pinecone, Thanks for the reply.
My Question wasn’t regarding the index service of pinecone. Rather I was inquiring about the unlaying clusters that are formed on our vectors.

Say I have around 2 million vectors in a Pinecone Index. When I run a query on it. it doesn’t really perform cosine similarity on the 2 million vectors, would it? That would be painfully slow and resource intensive.

So, I wanted to know what kind of clustering pinecone performs on those vectors to speed up the process.