Querying if a Document already exists in an index

jonosooty · April 11, 2023, 3:36pm

Hi,
I am new to pinecone and LLMs so excuse the basic question.

I want to add PDFs to a “knowledge base” and then be able to query these documents. So far this works pretty well, but I want to only add the documents to pinecone if they don’t already exist. I added the documents with a GUID and stored this in the metadata.

But my code always fails. with AttributeError: ‘dict’ object has no attribute ‘_composed_schemas’. Could anyone provide any hint as to what the problem might be. I understand that I have to pass a dummy vector , I cannot just pass None or just the metadata.

Thanks for the help!

def pinecone_document_exists(guid,index_name):
    
    # initialize pinecone
    #pinecone.init(
    #    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    #    environment=PINECONE_API_ENV # next to api key in console
    #)

    index = pinecone.Index(index_name)
    st.info(index)

    # Query for documents with metadata value "blue"
    query = {"guid": guid}
    st.info(f"query: {query}")

    # Need to pass also the vector , but this can be just the embedding dimension
    results = index.query(
        vector=[0] * 1536,
        filter=[query],
        top_k=1,
        include_metadata=True,
    )    

    # Return True if the id was found, False otherwise
    if len(results) > 0:
        return True
    else:
        return False

Jasper · April 12, 2023, 7:43am

Hi!

Can you provide the code you used for upserting the vectors?

At first glance… Did you add guid to metadata so you want to filter then that way? Maybe try using fetch with the guid if you used it for id when upserting (and you know which id you are actually searching for) just to see if you get back some vectors. After that continue with the query.

Also as per documentation:

query_response = index.query(
    namespace='example-namespace',
    top_k=10,
    ...
    filter={
        'genre': {'$in': ['comedy', 'documentary', 'drama']}
    }

the filter is an object not an array. Maybe its just that

Hope this helps

jonosooty · April 12, 2023, 9:37am

Hi Jasper,
Thanks for taking time with the reply. I am using this as a learning exercise. So I am learning python at the same time. So of course I am making many dumb errors. I am using langchain to do the upsert. It appears that the objects are created in piencone , I can see through the console. My first confusion is whether vector is mandatory or optional in the query. And if mandatory , how to specify this. I will check query part as suggested.

            metadata = {
                "guid": guid, "chunk_number": i+1,"filename": uploaded_file.name 
                }

            docsearch = Pinecone.from_texts(
                [text.page_content], embeddings,
                index_name=index_name, namespace=namespace,
                metadatas=[metadata]
            )

Here is a vector in pinecone with the metadata.

{
  "chunk_number": 11,
  "filename": "UrolFlux.pdf",
  "guid": "e9eb84e2-77df-583f-b9eb-436c1c66b762",
  "text": "8.   Zulassungsnummer\n\n47251.00.00\n\n9.\n\nDatum der Erteilung der Zulassung/Verlängerung der Zulassung\n\n06.06.2001\n\n10.  Stand der Information\n\nJuni 2014\n\n11.    Verkaufsabgrenzung\n\nApothekenpflichtig\n\n4 von 4"
}

Thanks for your help! Much appreciated.

Jasper · April 12, 2023, 10:09am

Great! Learning through example is the way to go imo

About vector being mandatory. Based on the API documentation, one of these two should be required when calling query(), vector or id.

The problem you might be facing is that you do not know what the id of your vector(s) is as Langchain takes care of upserting. So you will need a vector to query and with filter that you used in the first code snippet (not in array) you should be getting some results.

jonosooty · April 12, 2023, 10:21am

Thanks Jasper - seems to be working now with the ```
vector=[0] * 1536 passed. You were right about the array! - duh.

Was trying to get away without a local database of ids. So this at least works.
Maybe not the most efficient way and maybe I will end up with some local persistence in my app - let’s see.

Thanks for the help - have a great day!

jonosooty · April 25, 2023, 2:22pm

Hi, I am sorry I have a follow on query. I can now detect if the document exist so can avoid adding the documents to the index again. However, I am struggling to then use the returned index object to query.

Here is the code that returns the index object if a document is already found. Seems to return a sensible response and an index object.

    # Checks to see if the document is already in the PINECONE index and returns this is it is.
    #@st.cache(suppress_st_warning=True)
    def pinecone_document_exists(guid,index_name,namespace):
        # initialize pinecone
        #pinecone.init(
        #    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
        #    environment=PINECONE_API_ENV # next to api key in console
        #)

        index = pinecone.Index(index_name)
        #st.info(index)

        # Query for documents with metadata value "blue"
        query = {"guid": guid}
        #st.info(f"filtering by: {query}")

        # Need to pass also the vector , but this can be just the embedding dimension
        try:
            query_response = index.query(
                vector=[0] * 1536,
                filter=query,
                top_k=1,
                include_metadata=True,
                namespace=namespace,
            )
        except:
            # If the index is empty it has no name space so the query will fail first time around so return false
            return False

        st.info(query_response) 

        matches_list = query_response['matches']  # Extracting the "matches" list
        if len(matches_list) == 0:
            st.info("The document was not found in PINECONE")
            return False
        else:
            st.info(f"The document was found in PINECONE {matches_list}")
            return index

However, when I try to use this index and do a similarity search in langchain. I get the following error. The query code runs fine in the cases I add the documents to PINECONE. This is the error I get.

AttributeError: 'Index' object has no attribute 'similarity_search'
Traceback:

File "C:\Users\jonathan.sutcliffe\Anaconda3\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 556, in _run_script
    exec(code, module.__dict__)
File "C:\Users\jonathan.sutcliffe\app.py", line 595, in <module>
    start_chat(docsearch,namespace,guid)
File "C:\Users\jonathan.sutcliffe\app.py", line 424, in start_chat
    docs = docsearch.similarity_search(query,include_metadata=True, namespace=namespace,k=4, filter=metadata)

This is the search code that executes. As I say it works if I upsert, but not when I try to reuse the index.

        docs = docsearch.similarity_search(query,include_metadata=True, namespace=namespace,k=4, filter=metadata)
        for qry in query_list:
            st.warning(f"Question: {qry}")
            # Find the matching documents for the qry  
            message_response = chain.run(input_documents=docs, question=qry,verbose=True)
            st.success(message_response)

        st.info("Got to end of function - chat")

where docsearch is the index object returned from the 1st function.

Below is the actual code that does an upsert that after this I can successfully query.

    # If the document is not in pinecone then we need to chunk it and then upsert the chunks into the pinecone index
        if docsearch == False:
            # upload the document to pinecone and create the embeddings 
            # We didn't find the document so now we need to create chunks and  all the embeddings (vectors) and store these in PINECONE 
            st.warning("Didn't find the document so processing and adding to DB")
            with st.spinner("Chunking, Creating embeddings and storing in Pinecone - will take a few minutes, normally done already in the data injestion stage"):
                # The files are in the temporary location: fpath = r"C:\Users\jonathan.sutcliffe\AppData\Local\Temp\*.pdf"
                # LOAD THE TEMPORARY FILE 
                loader = UnstructuredPDFLoader(str(tfp))
                data = loader.load()
                character_count = str(len(data[0].page_content))
                print (character_count)

                if data:
                    st.info (f"There are {character_count} total characters in your document")
                else:
                    st.error ("Didn't find any content in the PDf document")

                # CHUNK THE BIG FILE INTO SMALLER UNITS 
                # set the chunking parameters, seems like about 1000 tokens is the maximum with this model
                chsize=2000 # In tokens - can't be too big 
                chover=30 # sliding window

                # Now break the PDF file into the chunks 
                text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=chsize, chunk_overlap=chover)

                # Now we have the individual texts
                texts = text_splitter.split_documents(data)
                st.info (f'After chunking we now have {len(texts)} documents')

                # UPLOAD ALL THE CHUNKS INTO THE VECTOR DATABASE 
                # Upload each page to Pinecone with the same metadata
                for i, text in enumerate(texts):
                    # Create some metadata so we can identify this document only from the pinecone index/namespace
                    # Include the page number
                    metadata = {
                        "guid": guid, "chunk_number": i+1,"filename": uploaded_file.name 
                        }
        
                    # Upsert the documents into Vector DB and return the object to the index 
                    docsearch = Pinecone.from_texts(
                        [text.page_content], embeddings,
                        index_name=index_name, namespace=namespace,
                        metadatas=[metadata]
                    )
                    st.success(f'Uploaded sub-document chunk of vectors to PINECONE and added Metadata: {metadata}')

Sorry for the very long post - but I have been trying to fix this for a few hours now and I am going round in circles. Hopefully, its an easy spot for a better programmer than I.

Thanks in advance !
Jono

Jasper · April 26, 2023, 8:02am

Hi!

I checked your code and problem and I think you might have a problem with this line (as the error says):

docs = docsearch.similarity_search(query,include_metadata=True, namespace=namespace,k=4, filter=metadata)

here docsearch is a Pinecone Index object. I think you are swapping between Lanchain Pinecone and the actual pinecone library. The official Pinecone library Index object has NO method named similarity_search.

Check how you create the docsearch object before the upper line is called.

Hope this helps!

PS. Also if you add some more code with the part where you create the docsearch instance and the imports part that would help to see where the problem arises!

jonosooty · April 26, 2023, 8:27am

Jasper,
Ah I think your right. Thanks so much for the hint, will save me lots of tokens if this is the fix. I will let you know.

Cheers!
Jono

jonosooty · April 26, 2023, 12:27pm

Jasper, thanks this was the issue. I couldn’t quite figure out how to just get the index in the correct form - so for now i just upserted a dummy document to get the index. A bit rubbish , but keeps me moving until I figure out the better approach.

nisha.pythonwork123 · August 5, 2023, 6:46pm

hey, I need help, I have two files in one namespace in pinecone with the help of langchainopenai embeddings. I want to extract the difference between the content and summrization. How do i do that. ??

jonosooty · August 7, 2023, 7:34am

Not some thing for pinecone. But I am not really sure wthat you mean be summarization? Do you mean the semantics of the document expressed in the embeddings or a summary you have created ?

wim.vandebrug · September 13, 2023, 7:42pm

May be you could use the new index feature of Langchain? Using this indexing automatically takes care of updating already indexed documents instead of creating a new vector document.
See: Indexing | 🦜️🔗 Langchain