I’ve been trying to figure out how to make Langchain and Pinecone work together to upsert a lot of document objects. Here’s the function call:
from langchain_community.document_loaders import BSHTMLLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_pinecone import PineconeVectorStore
from langchain.docstore.document import Document
import itertools
import sys
embeddings_model = OpenAIEmbeddings()
def load_upsert_html(globFile, index):
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
for f in globFile:
if f.name.split(".")[0].isnumeric():
loader = BSHTMLLoader(f)
document = loader.load()
if sys.getsizeof(document[0].page_content) > 40000:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
split_html = text_splitter.split_documents(document)
for h in split_html:
document_split = [Document(page_content=h.page_content, metadata={"source": h.metadata["source"], "title": h.metadata["title"]})]
vectorstore = PineconeVectorStore.from_documents(document_split, index_name=index_name, embedding=embeddings, batch_size=100, pool_threads=10)
else:
vectorstore = PineconeVectorStore.from_documents(document, index_name=index_name, embedding=embeddings, batch_size=100, pool_threads=10)
documents = load_upsert_html(globFile=myglobfile, index=my index)
Everything works great till I get to about 1400 vector uploads. The glob file mentioned in the function is a recursive glob of about 1500 html files that have anywhere from one 5 sentence paragraph to 2-3 pages of full text.
When I get to about 1400ish vectors uploaded, I get the following error:
RuntimeError Traceback (most recent call last)
/usr/lib/python3.10/multiprocessing/pool.py in __init__(self, processes, initializer, initargs, maxtasksperchild, context)
214 try:
--> 215 self._repopulate_pool()
216 except Exception:
19 frames
RuntimeError: can't start new thread
I’ve tried to adjust the batch size and thread pools in the Langchain upsert class, but I may just be misunderstanding this.
I tried this with a Google Colab Pro Jupyter Notebook on a T4 runtime just to see if it was a resource issue, but still the same error.
Any assistance would be greatly appreciated!