Sub-index upsert and search

steegecs · March 12, 2023, 7:09pm

Hello!

I am wondering if there is a way to create sub-indexes. The reason I want this application, is that I would like to have small collections of embeddings that can be referenced together, because they are all embeddings of the same document. If this document changes, I want to be able to re-embed the document, which my turn out to give you a different number of vectors if it grows or shrinks in size, and upsert them all at once with the new collection of embeddings.

Is this doable? And can it be done efficiently with Pinecone?

I’d also like this application to be such that when I do a cosine similarity with some other embedding against my whole index, it can select individual embeddings from different subindexes like it would if they were all grouped together.

steegecs · March 12, 2023, 7:25pm

I see I can maybe use namespaces. But is there an issue if I am creating many namespaces? And it does not look like there is a way for me to run a query against multiple namespaces at one.

richd · March 13, 2023, 1:11am

You may want to consider having a parallel database in MySQL or similar that keeps track of the embedding ID’s and links them to a master document ID. Then when you query Pinecone, you can get the ID of the embedding and then figure out which document it was from in a separate query. Unless I am not understanding your question.

paolo · August 25, 2023, 3:10am

Hey richd, I just have a question about this, I just started using pinecone and I’m just learning about vector databases in general. I have a collection of structured documents that I need to query. I think using a langchain SQL-agent would be the solution there. However I would like to allow the user to ask questions about the particular document that was returned.

That’s where I am having some difficulty wrapping my head around. I think I would have to chunk all the documents in the database and embed each of them. However, then what happens when the user wants to ask questions about a specific document? The line of thinking i’m exploring is taking the unique ID of the document from the database and chunking each document with the metadata using the unique ID from the other database might work.

There are around 23,000 documents and they are structured. I was wondering if something like is viable or cost-effective or if you have since March come up with better strategies around this. I’m not sure if syncing the vector ID and the SQL ID would do it.