Add new data to an existing index

garciafaciog · March 25, 2024, 7:03pm

I have an index in Pinecone named ‘rag’ that already has embeddings from an SQL query. Now I need to add embeddings from PDFs that I have stored in Dropbox, to the same index.

Is it possible to add information from two different sources to the same index? I’ve tried using upsert but my data isn’t loading.

This is the original code without upsert , just the dropbox connection

def pdf_loader(self, dropbox_folder: str):
        """
        Embed PDF.
        1. Load PDF document text data
        2. Split into pages
        3. Embed each page
        4. Store in Pinecone

        Note: it's important to make sure that the "context" field that holds the document text
        in the metadata is not indexed. Currently you need to specify explicitly the fields you
        do want to index. For more information checkout
        https://docs.pinecone.io/docs/manage-indexes#selective-metadata-indexing
        """
        self.initialize()
        # Dropox connection
        access_token = ''"
        dbx = dropbox.Dropbox(access_token)

        pdf_files = []
        for entry in dbx.files_list_folder(dropbox_folder).entries:
            if isinstance(entry,dropbox.files.FileMetadata) and entry.name.endswith('.pdf'):
                pdf_files.append(entry)
        
        for i,pdf_file in enumerate (pdf_files,start=1):
            print(f'Loading PDF {i} of {len(pdf_files)}: {pdf_file.name}')

            #Download archive from dropbox
            _,response=dbx.files_download(pdf_file.path_display)
            pdf_content=response.content.decode('utf-8')

            #Processing PDF content
            loader = PyPDFLoader(file_content=pdf_content)
            docs = loader.load()
            for doc in docs:
                
                documents = self.text_splitter.create_documents([doc.page_content])
                document_texts = [doc.page_content for doc in documents]
                embeddings = self.openai_embeddings.embed_documents(document_texts)
                self.vector_store.add_documents(documents=documents, embeddings=embeddings)

        print("Finished loading PDFs. \n" + self.index_stats)

Cory_Pinecone · March 28, 2024, 2:28am

Are you using the same embedding model for each set of documents? If so, you should be able to store them in the same index. What’s the error you’re getting when you try to upsert?