Upsert text in batches using integrated embedding

I am trying to batch upsert text chunks and have Pinecone do the embedding with their integrated embedding. My index is pinecone-sparse-english-v0, and I have my text chunked. Below is my chunking code:

from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import time
from typing import List, Dict, Any

def chunk_pdf(pdf_path: str, document_name: str) -> List[Dict[str, Any]]:

    created_at = int(time.time() * 1000) 
    try:
        reader = PdfReader(pdf_path)
        text = " ".join(page.extract_text() or "" for page in reader.pages)
    except Exception as e:
        raise Exception(f"Error loading PDF: {str(e)}")
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=600,
        chunk_overlap=25,
    )
    
    chunks = splitter.split_text(text)
    
    formatted_chunks = []
    for i, chunk in enumerate(chunks, 1):
        chunk_id = f"{document_doi}#chunk{i}"
        formatted_chunks.append({
            "_id": chunk_id,
            "chunk_text": chunk.strip(),
            "document_name": document_name,
            "chunk_number": i,
            "created_at": created_at
        })
    
    return formatted_chunks

I then want to batch upsert these chunks into the index. I’m confused on the format that Pinecone is expecting the batches, as the example is using embeddings. Do I simply just create my own batches of 96 chunks per upsert_records function and run as normal?

Hi there! With integrated embedding indexes, you can use the index.upsert_records function described here. Within each upsert_records call, you can pass in a batch of up to 96 records as shown below. Preparing your own batches of 96 and then passing them to the upsert_records function should embed and upsert the records into your index.

index.upsert_records(
    "example-namespace",
    [
        {
            "_id": "rec1",
            "chunk_text": "Apples are a great source of dietary fiber, which supports digestion and helps maintain a healthy gut.",
            "category": "digestive system", 
        },
        ... up to 96 records
   ])

   

Please let us know if you have any more questions about upserting to indexes or Pinecone Inference models :slight_smile:

Adam, Engineering @ Pinecone

1 Like