Returning list of IDs

This is a basic feature for any database… :face_with_diagonal_mouth:

What is the recommended approach to avoid inserting duplicate data in the database?
As I can’t get the IDs, how would I check if a document already exists in the database?

3 Likes

Hey! I tried this approach but updating 1.3M vectors takes forever. Do you have some clever approach for this?

Hi! In our case, we have a complex, dynamic knowledge base that changes on a daily basis. I need to perform regular syncs to pinecone and to do that I need to know what vectors to add, what vectors to delete and what vectors to modify. In other words, I need the whole list of vector ids

3 Likes

All you would need to do is create an attribute for each index called ids_list, and update that every time a vector is added to the DB.

As it is, I have 20k vectors in my Pinecone DB and I can only visualize half of them.

+1 to this use case; we need to add/update/delete from a source of truth DB to pinecone, and we need to know everything in pinecone to do that

4 Likes

My use case is yielding similar challenge. Since querying requires a vector (as far as I know, I cannot query by ID) and fetching requires an exact ID, the only solution I can envisage as of today is to create a parallel ID database, which is crazy rework IMHO and will affect app performance… As a minimum, I would like to be able to fetch by ID using wildcards (yes, this would actually help in my specific use case). But as of now this is a bit of a showstopper.

I’m relatively new to Python and the world of LLM application development, transitioning from a background in .NET and SQL where I crafted business applications. So, I apologize in advance for any beginner oversights. I am using (langchain.document_loaders.unstructured.UnstructuredAPIFileLoader — 🦜🔗 LangChain 0.0.266) UnstructuredAPIFileLoader from my own server locally, so you will have to change it. I intend to adjust the code to give users the choice of retaining the file. Given that I’m processing numerous files in batches, I’d prefer no disruptions. Files that aren’t incorporated will be set aside for future examination. GitHub - LarryStewart2022/pinecone_Index: Multiple file upload duplicate solution

I mirror everyone’s sentiment here as how can you refresh your VectorDB without the ability to compare what you have to what’s new and what’s changed. Yes you have the Upsert function that should do that automatically, but then how do you use that with say the GPTVectorStoreIndex or VectorStoreIndex that is creating the embeddings in the first place and in my case mine are transformed with Hugging face to be even better embeddings? So can someone post a way to do this as it myst be possible for an existing Index in Pinecone surely? If not how can this even be a solution and what alternative would you use instead? Weaviate or ChromaDb and also how as its kind of the same problem with all Vector Indexes as well no? I have got a unique ID in my Vectors stored as metadata so it should be possible to pull back all the documents in my Vector DB using the metadata, but what a pullava and how would i still compare the vector ID’s to handle changed data from say a Llama Index data loader?

Hi, all! Audrey from Pinecone here :slight_smile:

This is a feature we’ve talked about internally a lot here. Since many of our users have an extremely high amount of vectors in production, we are figuring out the best way to enable this feature that both allows it to scale (both in compute & cost) and allows users to easily implement it.

But fear not, we hear you & it’s definitely on our roadmap. We will update this thread as soon as it’s available!

3 Likes

I have done this for namespaces between 200K and 750K large without issues in order to “replicate” the index locally on alternative vectorstores. This is the script that is used to the project.

It does what people in this thread are recommending. Empty vector with a tracking id that is updated each loop so I can “walk” the entire database. From there, once its in the tool I can migrate the data to wherever I want to test locally or just replicate without messing up production data on Pinecone.

For reference, here is the code snippet so you can see how I currently do this.

Cavet: You need a namespace enabled index to use this script because the delay between updates on the free-starter tier can take 30+ seconds to save. Additional I have not tested this over 750K vectors! Your mileage may vary

1 Like

My use case is that I use namespaces, and I want to duplicate a namespace to a new namespace (with a new name).

The Github project i linked can “clone” namespaces or entire indexes from Pinecone into the same index under a new name.

You’ll still need to “walk” the entire vectorspace in that specific namespace, but it works all the same. The way to do it via code is actually the exact same as the snippet above.

You basically need to walk the vectors you want to copy and store them locally.
Then loop over the stored vectors and upsert into Pinecone with a new ID and the new namespace
Voila - that’s it!.

That way you arent paying for embedding twice or anything. From what I know, there is no way to make a collection snapshot of a namespace or simply clone/duplicate a namespace from the SDK so this is what we do in the meantime.

3 Likes

Thanks Tim, I’m using Vector Admin to cloning a namespace/Workspace right now! It’s a bit slow for 6K vectors, but perhaps Pinecone will offer a richer API. Great work on Vector Admin!

2 Likes