Saving RecursiveCharacterTextSplitter results to an Index for reuse in a Similarity Search

fdir · April 4, 2023, 11:58pm

I am attempting to split multiple PDFs into chunks for upsert into an index so that I can load PDFs one time and reuse the index for similarity search without reloading all PDFs every application restart.

Ask a book is the reference I used to start building.

I load my PDF with

loader = UnstructuredPDFLoader(filename)
data = loader.load()

and then split the text

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
text = text_splitter.split_documents(data)

But when I upsert the Documents, I receive an error that my Documents cannot be interpreted as Vectors (not covered by the example)

pinecone.init(
  api_key=pinecone_api_key,
  environment=pinecone_api_env,
)

if index_name not in pinecone.list_indexes():
  pinecone.create_index(index_name, dimension=1536)

index: Index = pinecone.Index(index_name)
index.upsert(vectors=texts)

Error: ValueError: Invalid vector value passed: cannot interpret type <class 'langchain.schema.Document'>

The question is: How do I take non-structured PDF data, parsed as xml.dom.minidom.Documents, and convert them to Vectors for upserting and then later utilize this Index to query with a similarity search, such as in the following snippet?

llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
chain = load_qa_chain(llm, chain_type="stuff")
docsearch = Pinecone.from_texts([text-loaded-from-pinecone-vector-db], embeddings, index_name=index_name)
docs = docsearch.similarity_search(query, include_metadata=True)

I also referenced Batching Upserts, but I think I’m missing a critical piece of the puzzle covered by neither article.

Jasper · April 5, 2023, 6:10am

Hi!

To use the vector database, you should vectorize your PDF texts and upsert these vectors (create embeddings via OpenAI API, BERT, Doc2Vec etc.). You can add the texts into the database as metadata.

When you use semantic search you will have to vectorize the questions or queries the same way you vectorized your split PDFs.

Helpful refs: Semantic Search, OpenAI

As per documentation, vectors you are upserting (via API or library) should be “array of floats”.

Hope this helps.

fdir · April 6, 2023, 2:01am

thnx jasper, your explanation helped me wrap my head around what I was dealing with.

in the end, finding how to get a reference to a vector database from an existing index, and relying on pinecone, is what ultimately solved my issue. instead of splitting into vectors an upserting myself, I simply relied on pinecone to do that.

embeddings = OpenAIEmbeddings()
Pinecone.from_texts(
    [t.page_content for t in texts],
    embeddings,
    index_name=index_name
)

and then, in the same way, am able to produce text completion results on an existing vector db.

embeddings = OpenAIEmbeddings()
docsearch = Pinecone(index, embeddings.embed_query, 'text')
llm = OpenAI()
chain = load_qa_chain(llm, chain_type="stuff")
docs = docsearch.similarity_search(query=query)
chain.run(input_documents=docs, question=query)