Error during upsert with Langchain

I’ve been trying to figure out how to make Langchain and Pinecone work together to upsert a lot of document objects. Here’s the function call:

from langchain_community.document_loaders import BSHTMLLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_pinecone import PineconeVectorStore
from langchain.docstore.document import Document
import itertools

import sys

embeddings_model = OpenAIEmbeddings()

def load_upsert_html(globFile, index):

  embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

  for f in globFile:
    if f.name.split(".")[0].isnumeric():
      loader = BSHTMLLoader(f)
      document = loader.load()
      if sys.getsizeof(document[0].page_content) > 40000:
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
        split_html = text_splitter.split_documents(document)
        for h in split_html:
          document_split = [Document(page_content=h.page_content, metadata={"source": h.metadata["source"], "title": h.metadata["title"]})]
          vectorstore = PineconeVectorStore.from_documents(document_split, index_name=index_name, embedding=embeddings, batch_size=100, pool_threads=10)
      else:
        vectorstore = PineconeVectorStore.from_documents(document, index_name=index_name, embedding=embeddings, batch_size=100, pool_threads=10)

documents = load_upsert_html(globFile=myglobfile, index=my index)

Everything works great till I get to about 1400 vector uploads. The glob file mentioned in the function is a recursive glob of about 1500 html files that have anywhere from one 5 sentence paragraph to 2-3 pages of full text.

When I get to about 1400ish vectors uploaded, I get the following error:

RuntimeError                              Traceback (most recent call last)
/usr/lib/python3.10/multiprocessing/pool.py in __init__(self, processes, initializer, initargs, maxtasksperchild, context)
    214         try:
--> 215             self._repopulate_pool()
    216         except Exception:

19 frames
RuntimeError: can't start new thread

I’ve tried to adjust the batch size and thread pools in the Langchain upsert class, but I may just be misunderstanding this.

I tried this with a Google Colab Pro Jupyter Notebook on a T4 runtime just to see if it was a resource issue, but still the same error.

Any assistance would be greatly appreciated!

Hello @clyde.hunter1984 and thank you for posting.

The error message RuntimeError: can't start new thread is a low-level error; starting the Python runtime cannot create a new thread. This normally happens when a program uses a large number of file handles.

Looking at the code, I see inside the loop you’re creating a vectorstore each with 10 threads. I suspect that the vectorstore is not getting cleaned up and holding open the file handles (connections).

The correct approach would be to create the vectorstore outside the loop.

vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)
  for f in globFile:
    if f.name.split(".")[0].isnumeric():
      loader = BSHTMLLoader(f)
      document = loader.load()
      if sys.getsizeof(document[0].page_content) > 40000:
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
        split_html = text_splitter.split_documents(document)
        for h in split_html:
          document_split = [Document(page_content=h.page_content, metadata={"source": h.metadata["source"], "title": h.metadata["title"]})]
          vectorstore.add_documents(document_split)
      else:
        vectorstore.add_documents(document_split)

I recommend reading our LangChain guide.

Hello @patrick1 and thank you for the response!

I wrote my previous post via my phone and I was worried I didn’t get enough information in the post to get a response! Thank you for your guidance. I had tried to run the PineconeVectorStore class outside of the loop in the past and I don’t know why I decided to instantiate it inside the loop, lol. I’ll move it outside the loop and see if that’s the issue.

Edit / Update 1: I’m running it right now. I’ll update shortly.

One thing I noticed in your response is that you used: PineconeVectorStore(index_name=index_name, embedding=embeddings).

In the past I had used PineconeVectorStore(index=index_name, embedding=embeddings) and kept receiving errors. I think that caused me to rethink my code and thus somehow I ended up putting the class inside the loop (dumb I know).

I’ve now learned that using index=index_name vs index_name=index_name makes a big difference when instantiating the PineconeVectorStore class. I’m only posting this in case someone else has this issue in the future.

PineconeVectorStore(index_name=index_name, embedding=embeddings) allows for the class to take a string as an index name.

I assume PineconeVectorStore(index=index_name, embedding=embeddings) expects the index_name variable to be a Pinecone index class such as pc.Index("indexName")?

Thanks again. I’ll post an update again shortly.

Edit / Update 2: Everything worked as intended and my issue is resolved!
I was able to upsert 4,184 vectors without an issue once I made that small change of shifting the instantiation of the PineconeVectorStore class out of the loop, lol.

Thanks again for the assistance.

For those interested in the future, here is the final code that allowed this to work. A couple of things to keep in mind:

  1. I was processing a large number of individual html files
  2. Each html file had either 1-5 sentences of text or 2-3 pages of text. I wouldn’t know until I processed them.
  3. I used Langchain for this project for AI Agent development purposes so I wanted to use it with Pinecone vice the pure Pinecone client approach.
from langchain_community.document_loaders import BSHTMLLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_pinecone import PineconeVectorStore
from langchain.docstore.document import Document
import itertools
from pathlib import Path
import sys

filePath = "/myPath/toDirectory"
htmlFiles = Path(filePath).rglob('*.html')
index_name = "myIndex"

def load_split_html(globFile):
  documents = []
  for f in globFile:
    if f.name.split(".")[0].isnumeric():
      loader = BSHTMLLoader(f)
      document = loader.load()
      # This ensures we don't get a metadata size error based on Pinecone's metadata size limit if the document is too big
      if sys.getsizeof(document[0].page_content) > 40000:
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
        split_html = text_splitter.split_documents(document)
        for h in split_html:
          document_split = [Document(page_content=h.page_content, metadata={"source": h.metadata["source"], "title": h.metadata["title"]})]
          documents.append(document_split)
      else:
        documents.append(document)
  return documents

#Get a list of documents to process
documents = load_split_html(globFile=htmlFiles)

#Instantiate embeddings and vectorstore
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)

#Upsert documents to Pinecone Index
for d in documents:
  vectorstore.add_documents(d)

I’m still learning, so there may be a more concise method / approach to do this; however, I figured I’d share in the event others were looking for a possible solution / example.

@clyde.hunter1984 Thank you for the update.

If you care about performance, you might want to consider using the GRPC version of Pinecone:

from pinecone.grpc import PineconeGRPC
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
embeddings = OpenAIEmbeddings()
pc = PineconeGRPC(api_key='XXXXXXXXXX')

index = pc.Index('test')
vectorstore = PineconeVectorStore(index=index, embedding=embeddings)

#async_req=False is important as the async method differs between GRPC and the normal client
vectorstore.add_texts(['hello pinecone index'], async_req=False)

Thank you for the tip! I was looking at possibly using that. I’ll post another thread shortly regarding that approach.

Thanks again!