Changing Metadata on Chunked Document (multiple IDs w/ same metadata, but only have access to 1 ID)

xanderklein10 · September 29, 2023, 9:56pm

I am wondering if it’s possible (or how one might recommend) to change the metadata for a given vector that’s part of a larger document that’s been chunked.

I see this documentation: Manage data.

But it does not explain what happens if I have chunked a large document with metadata, and I only have access to 1 ID – I’d like to be able to change the metadata for that ID and have it change the metadata for all the chunked vectors that refer to the same document w/ the same metadata.

To clarify:

I’ve taken a large pdf and chunked, embedded, and added it to the db with metadata
Each chunk has the same metadata
Now I made a query, and I received 1 vector that has a unique ID and the identical metadata as the other embedded chunks
I would like to be able to edit the metadata and have all the other vectors that were from the same original PDF document also have their metadata changed with the new updated metadata

If Pinecone does not have a solution for this, I would appreciate any recommendation on how to accomplish it. Thank you very much!

roy · September 30, 2023, 6:21pm

Hi @xanderklein10

Generally, Pinecone keep API design simple and robust, that is why for example we do not have a bulk update operation. that said your question is interesting and I would love to learn more

solution:
Given that you have a single chunk from a document you can do the following:

first, the chunk ID and doc id should be related, we recommend to have the chunks ids as follows {document_id}_{chunk_id} where chunk id is a running integer (1,2…n)
in the metadata of each chunk vector add a field document_chunk_count and set it to n
given that you queried a vector with include_metadata=True you can now generate a list of all the chunks by doing something like

doc_id = '_'.join(vector_id.split('_')[:-1])
ids = [f"{doc_id}_{i+1}" for i in range(document_chunk_count)]

you can now iterate over each and update it.

for i in ids:
    index.update(...

Just out of curiosity - why you wish to bulk update metadata for all the chunks? can you elaborate a bit more?

best,

Roy

xanderklein10 · October 1, 2023, 9:51pm

Roy, thank you very much for your response. It’s very helpful and interesting to see how this could be done.

The reason why I would like to do this is because over time, the metadata for a retrieved chunk might need to be updated based on real-time data – but since I only have access to 1 chunk ID, I need to be able to update all of the chunks with the new metadata.

Regarding your method of chunking the document with the document ID ( {document_id}_{chunk_id} where chunk id is a running integer (1,2…n)), I am wondering how this can be done. Currently, I do not think the retrieved IDs have any info on the parent document. This is how I am setting up my db:

Process the JSON into the embeddable content vs the metadata and put it into Document format.

docs =
with open(‘content_with_metadata.json’, ‘r’, encoding=‘utf-8-sig’) as jsonfile:
data = json.load(jsonfile)
for entry in data:
to_metadata = {col: entry[col] for col in columns_to_metadata if col in entry}
values_to_embed = {k: entry[k] for k in columns_to_embed if k in entry}
to_embed = “\n”.join(f"{k.strip()}: {’ '.join(map(str, v)).strip() if isinstance(v, list) else v.strip()}" for k, v in values_to_embed.items())
newDoc = Document(page_content=to_embed, metadata=to_metadata)
docs.append(newDoc)

Split the document using Character splitting.

splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len)
documents = splitter.split_documents(docs)

Initialize the embeddings model

embeddings_model = OpenAIEmbeddings()

Generate embeddings from documents and store in Pinecone

db = Pinecone.from_documents(documents, embeddings_model, index_name=“new”)

Please let me know if you have any advice! Thank you.

Just another thought/idea – I could probably edit the metadata for each chunk that refers to the same parent doc by writing a query to db that includes all of the metadata from the single chunked vector I have. That way, the query would retrieve all vectors with the identical metadata, which I can then iterate over and update with the desired changes. I don’t know if this is the best approach or is scalable, but it could work…

xanderklein10 · October 1, 2023, 9:55pm

And maybe I am not setting up my db in the best way (I’m using Langhcain, but I would prefer to just use Pinecone native code if possible). I would appreciate any advice on that too!

I am simply trying to chunk PDFs and add the doc’s metadata to each chunk to be embedded. This way, I can do a similarity search based on content, but the metadata in the retrieved vector will include all of the important information I want to use/display in the frontend.

silas · October 3, 2023, 4:25pm

I think this is a case where maybe you’re relying too much on loading non-vector data into Pinecone vs a more traditional data store.

Your use case
If I read this correctly, you’re taking a PDF document and splitting it up into say 10 chunks. Each of those 10 chunks is vectorized and stored in Pinecone. In addition, you’re taking a set of metadata about that PDF (just for example the name of document, it’s updated time, it’s source location, etc.) and associating that set of metadata with all 10 chunks (same metadata for all 10 chunks). Is that right?

Metadata
The purpose of the metadata is primarily to allow you to filter search results, and it can also be used to store associative information alongside a vector to be used at retrieval time. The primary purpose of storing data in Pinecone at all is so that you can perform vector semantic search to locate a relevant piece of data.

Alternative
So rather than store the actual metadata in Pinecone, what if you stored a metadata_id identifier in the Pinecone metadata that points to another database like DynamoDB or MongoDB, and would contain the metadata for each document?

When you run the vector search and get back a search result, you can look up the metadata in the other database using the metadata_id.

When you need to manage/update the metadata, you can easily do that independently of its vector content in Pinecone.

xanderklein10 · October 3, 2023, 5:00pm

@Silas This sounds like a fantastic approach. I love the idea of separating the data types, and it will also give me the ability to retrieve the actual PDF parent document from Mongo with the associated metadata.

The one issue I see with it is that as we build out this application, we may end up wanting to search/filter by metadata.

silas · October 3, 2023, 5:07pm

Yep, makes sense. You can still include the metadata that you want to filter on in Pinecone.

You can keep a back-reference in your metadata table that points to the Pinecone ID (or list of IDs) where the relevant filter metadata needs to be updated in Pinecone when the primary metadata changes.

The main point is there is no reason you have to limit yourself to only a vector DB, and in many cases you can save money by keeping non-vector data outside of Pinecone, and limit your Pinecone uses to just those that enable your semantic search, and nothing else. That’s what it’s built for anyway.

Good luck, and happy to talk further if needed.
Silas @ninetack.io