I am trying to batch upsert text chunks and have Pinecone do the embedding with their integrated embedding. My index is pinecone-sparse-english-v0, and I have my text chunked. Below is my chunking code:
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import time
from typing import List, Dict, Any
def chunk_pdf(pdf_path: str, document_name: str) -> List[Dict[str, Any]]:
created_at = int(time.time() * 1000)
try:
reader = PdfReader(pdf_path)
text = " ".join(page.extract_text() or "" for page in reader.pages)
except Exception as e:
raise Exception(f"Error loading PDF: {str(e)}")
splitter = RecursiveCharacterTextSplitter(
chunk_size=600,
chunk_overlap=25,
)
chunks = splitter.split_text(text)
formatted_chunks = []
for i, chunk in enumerate(chunks, 1):
chunk_id = f"{document_doi}#chunk{i}"
formatted_chunks.append({
"_id": chunk_id,
"chunk_text": chunk.strip(),
"document_name": document_name,
"chunk_number": i,
"created_at": created_at
})
return formatted_chunks
I then want to batch upsert these chunks into the index. Iām confused on the format that Pinecone is expecting the batches, as the example is using embeddings. Do I simply just create my own batches of 96 chunks per upsert_records function and run as normal?