Error during upsert with Langchain

clyde.hunter1984 · May 21, 2024, 10:14pm

I’ve been trying to figure out how to make Langchain and Pinecone work together to upsert a lot of document objects. Here’s the function call:

from langchain_community.document_loaders import BSHTMLLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_pinecone import PineconeVectorStore
from langchain.docstore.document import Document
import itertools

import sys

embeddings_model = OpenAIEmbeddings()

def load_upsert_html(globFile, index):

  embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

  for f in globFile:
    if f.name.split(".")[0].isnumeric():
      loader = BSHTMLLoader(f)
      document = loader.load()
      if sys.getsizeof(document[0].page_content) > 40000:
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
        split_html = text_splitter.split_documents(document)
        for h in split_html:
          document_split = [Document(page_content=h.page_content, metadata={"source": h.metadata["source"], "title": h.metadata["title"]})]
          vectorstore = PineconeVectorStore.from_documents(document_split, index_name=index_name, embedding=embeddings, batch_size=100, pool_threads=10)
      else:
        vectorstore = PineconeVectorStore.from_documents(document, index_name=index_name, embedding=embeddings, batch_size=100, pool_threads=10)

documents = load_upsert_html(globFile=myglobfile, index=my index)

Everything works great till I get to about 1400 vector uploads. The glob file mentioned in the function is a recursive glob of about 1500 html files that have anywhere from one 5 sentence paragraph to 2-3 pages of full text.

When I get to about 1400ish vectors uploaded, I get the following error:

RuntimeError                              Traceback (most recent call last)
/usr/lib/python3.10/multiprocessing/pool.py in __init__(self, processes, initializer, initargs, maxtasksperchild, context)
    214         try:
--> 215             self._repopulate_pool()
    216         except Exception:

19 frames
RuntimeError: can't start new thread

I’ve tried to adjust the batch size and thread pools in the Langchain upsert class, but I may just be misunderstanding this.

I tried this with a Google Colab Pro Jupyter Notebook on a T4 runtime just to see if it was a resource issue, but still the same error.

Any assistance would be greatly appreciated!

patrick1 · May 22, 2024, 11:37am

Hello @clyde.hunter1984 and thank you for posting.

The error message RuntimeError: can't start new thread is a low-level error; starting the Python runtime cannot create a new thread. This normally happens when a program uses a large number of file handles.

Looking at the code, I see inside the loop you’re creating a vectorstore each with 10 threads. I suspect that the vectorstore is not getting cleaned up and holding open the file handles (connections).

The correct approach would be to create the vectorstore outside the loop.

vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)
  for f in globFile:
    if f.name.split(".")[0].isnumeric():
      loader = BSHTMLLoader(f)
      document = loader.load()
      if sys.getsizeof(document[0].page_content) > 40000:
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
        split_html = text_splitter.split_documents(document)
        for h in split_html:
          document_split = [Document(page_content=h.page_content, metadata={"source": h.metadata["source"], "title": h.metadata["title"]})]
          vectorstore.add_documents(document_split)
      else:
        vectorstore.add_documents(document_split)

I recommend reading our LangChain guide.

clyde.hunter1984 · May 22, 2024, 1:20pm

Hello @patrick1 and thank you for the response!

I wrote my previous post via my phone and I was worried I didn’t get enough information in the post to get a response! Thank you for your guidance. I had tried to run the PineconeVectorStore class outside of the loop in the past and I don’t know why I decided to instantiate it inside the loop, lol. I’ll move it outside the loop and see if that’s the issue.

Edit / Update 1: I’m running it right now. I’ll update shortly.

One thing I noticed in your response is that you used: PineconeVectorStore(index_name=index_name, embedding=embeddings).

In the past I had used PineconeVectorStore(index=index_name, embedding=embeddings) and kept receiving errors. I think that caused me to rethink my code and thus somehow I ended up putting the class inside the loop (dumb I know).

I’ve now learned that using index=index_name vs index_name=index_name makes a big difference when instantiating the PineconeVectorStore class. I’m only posting this in case someone else has this issue in the future.

PineconeVectorStore(index_name=index_name, embedding=embeddings) allows for the class to take a string as an index name.

I assume PineconeVectorStore(index=index_name, embedding=embeddings) expects the index_name variable to be a Pinecone index class such as pc.Index("indexName")?

Thanks again. I’ll post an update again shortly.

Edit / Update 2: Everything worked as intended and my issue is resolved!
I was able to upsert 4,184 vectors without an issue once I made that small change of shifting the instantiation of the PineconeVectorStore class out of the loop, lol.

Thanks again for the assistance.

clyde.hunter1984 · May 22, 2024, 3:14pm

For those interested in the future, here is the final code that allowed this to work. A couple of things to keep in mind:

I was processing a large number of individual html files
Each html file had either 1-5 sentences of text or 2-3 pages of text. I wouldn’t know until I processed them.
I used Langchain for this project for AI Agent development purposes so I wanted to use it with Pinecone vice the pure Pinecone client approach.

from langchain_community.document_loaders import BSHTMLLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_pinecone import PineconeVectorStore
from langchain.docstore.document import Document
import itertools
from pathlib import Path
import sys

filePath = "/myPath/toDirectory"
htmlFiles = Path(filePath).rglob('*.html')
index_name = "myIndex"

def load_split_html(globFile):
  documents = []
  for f in globFile:
    if f.name.split(".")[0].isnumeric():
      loader = BSHTMLLoader(f)
      document = loader.load()
      # This ensures we don't get a metadata size error based on Pinecone's metadata size limit if the document is too big
      if sys.getsizeof(document[0].page_content) > 40000:
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
        split_html = text_splitter.split_documents(document)
        for h in split_html:
          document_split = [Document(page_content=h.page_content, metadata={"source": h.metadata["source"], "title": h.metadata["title"]})]
          documents.append(document_split)
      else:
        documents.append(document)
  return documents

#Get a list of documents to process
documents = load_split_html(globFile=htmlFiles)

#Instantiate embeddings and vectorstore
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)

#Upsert documents to Pinecone Index
for d in documents:
  vectorstore.add_documents(d)

I’m still learning, so there may be a more concise method / approach to do this; however, I figured I’d share in the event others were looking for a possible solution / example.

patrick1 · May 22, 2024, 11:13pm

@clyde.hunter1984 Thank you for the update.

If you care about performance, you might want to consider using the GRPC version of Pinecone:

from pinecone.grpc import PineconeGRPC
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
embeddings = OpenAIEmbeddings()
pc = PineconeGRPC(api_key='XXXXXXXXXX')

index = pc.Index('test')
vectorstore = PineconeVectorStore(index=index, embedding=embeddings)

#async_req=False is important as the async method differs between GRPC and the normal client
vectorstore.add_texts(['hello pinecone index'], async_req=False)

clyde.hunter1984 · May 23, 2024, 12:27pm

Thank you for the tip! I was looking at possibly using that. I’ll post another thread shortly regarding that approach.

Thanks again!