I am attempting to split multiple PDFs into chunks for upsert into an index so that I can load PDFs one time and reuse the index for similarity search without reloading all PDFs every application restart.
Ask a book is the reference I used to start building.
I load my PDF with
loader = UnstructuredPDFLoader(filename)
data = loader.load()
and then split the text
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
text = text_splitter.split_documents(data)
But when I upsert the Documents, I receive an error that my Documents cannot be interpreted as Vectors (not covered by the example)
pinecone.init(
api_key=pinecone_api_key,
environment=pinecone_api_env,
)
if index_name not in pinecone.list_indexes():
pinecone.create_index(index_name, dimension=1536)
index: Index = pinecone.Index(index_name)
index.upsert(vectors=texts)
Error: ValueError: Invalid vector value passed: cannot interpret type <class 'langchain.schema.Document'>
The question is: How do I take non-structured PDF data, parsed as xml.dom.minidom.Documents
, and convert them to Vectors for upserting and then later utilize this Index to query with a similarity search, such as in the following snippet?
llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
chain = load_qa_chain(llm, chain_type="stuff")
docsearch = Pinecone.from_texts([text-loaded-from-pinecone-vector-db], embeddings, index_name=index_name)
docs = docsearch.similarity_search(query, include_metadata=True)
I also referenced Batching Upserts, but I think I’m missing a critical piece of the puzzle covered by neither article.