I have an index in Pinecone named ‘rag’ that already has embeddings from an SQL query. Now I need to add embeddings from PDFs that I have stored in Dropbox, to the same index.
Is it possible to add information from two different sources to the same index? I’ve tried using upsert but my data isn’t loading.
This is the original code without upsert , just the dropbox connection
def pdf_loader(self, dropbox_folder: str):
"""
Embed PDF.
1. Load PDF document text data
2. Split into pages
3. Embed each page
4. Store in Pinecone
Note: it's important to make sure that the "context" field that holds the document text
in the metadata is not indexed. Currently you need to specify explicitly the fields you
do want to index. For more information checkout
https://docs.pinecone.io/docs/manage-indexes#selective-metadata-indexing
"""
self.initialize()
# Dropox connection
access_token = ''"
dbx = dropbox.Dropbox(access_token)
pdf_files = []
for entry in dbx.files_list_folder(dropbox_folder).entries:
if isinstance(entry,dropbox.files.FileMetadata) and entry.name.endswith('.pdf'):
pdf_files.append(entry)
for i,pdf_file in enumerate (pdf_files,start=1):
print(f'Loading PDF {i} of {len(pdf_files)}: {pdf_file.name}')
#Download archive from dropbox
_,response=dbx.files_download(pdf_file.path_display)
pdf_content=response.content.decode('utf-8')
#Processing PDF content
loader = PyPDFLoader(file_content=pdf_content)
docs = loader.load()
for doc in docs:
documents = self.text_splitter.create_documents([doc.page_content])
document_texts = [doc.page_content for doc in documents]
embeddings = self.openai_embeddings.embed_documents(document_texts)
self.vector_store.add_documents(documents=documents, embeddings=embeddings)
print("Finished loading PDFs. \n" + self.index_stats)