Roy, thank you very much for your response. It’s very helpful and interesting to see how this could be done.
The reason why I would like to do this is because over time, the metadata for a retrieved chunk might need to be updated based on real-time data – but since I only have access to 1 chunk ID, I need to be able to update all of the chunks with the new metadata.
Regarding your method of chunking the document with the document ID ( {document_id}_{chunk_id}
where chunk id is a running integer (1,2…n)), I am wondering how this can be done. Currently, I do not think the retrieved IDs have any info on the parent document. This is how I am setting up my db:
Process the JSON into the embeddable content vs the metadata and put it into Document format.
docs =
with open(‘content_with_metadata.json’, ‘r’, encoding=‘utf-8-sig’) as jsonfile:
data = json.load(jsonfile)
for entry in data:
to_metadata = {col: entry[col] for col in columns_to_metadata if col in entry}
values_to_embed = {k: entry[k] for k in columns_to_embed if k in entry}
to_embed = “\n”.join(f"{k.strip()}: {’ '.join(map(str, v)).strip() if isinstance(v, list) else v.strip()}" for k, v in values_to_embed.items())
newDoc = Document(page_content=to_embed, metadata=to_metadata)
docs.append(newDoc)
Split the document using Character splitting.
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len)
documents = splitter.split_documents(docs)
Initialize the embeddings model
embeddings_model = OpenAIEmbeddings()
Generate embeddings from documents and store in Pinecone
db = Pinecone.from_documents(documents, embeddings_model, index_name=“new”)
Please let me know if you have any advice! Thank you.
Just another thought/idea – I could probably edit the metadata for each chunk that refers to the same parent doc by writing a query to db that includes all of the metadata from the single chunked vector I have. That way, the query would retrieve all vectors with the identical metadata, which I can then iterate over and update with the desired changes. I don’t know if this is the best approach or is scalable, but it could work…