Pinecone with LangChain: Index Object has no attribute 'exists'

zinan.yang · November 29, 2023, 1:59am

Hi all,

I am new to Pinecone and learning through out the way. I am creating a PDF reader application with LangChain and Pinecone. In one section of my code where I want to split the PDFs user upload into chunks and store them into Pinecone. But I only want to create a new embedding where user upload a new PDF. And I keep getting this error: AttributeError: ‘Index’ object has no attribute ‘exists’.

Can someone please help me. Thank you!

Here is the section of my code where handles this:

read PDF

if pdf is not None: 
    pdf_reader = PdfReader(pdf)
    
    # split document into chunks
    # also can use text split: good for PDFs that do not contains charts and visuals
    sections = []
    for page in pdf_reader.pages:
        # Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
        page_sections = page.extract_text().split('\n\n')
        sections.extend(page_sections)

    chunks = sections
    # st.write(chunks)

    ## embeddings
    # Set up Pinecone
    pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
    index_name = 'langchainresearch'
    if index_name not in pinecone.list_indexes():
        pinecone.create_index(index_name, dimension=1536, metric="cosine")  # Adjust the dimension as per your embeddings
    index = pinecone.Index(index_name)

    file_name = pdf.name[:-4]
    
    # Check if embeddings are already stored in Pinecone
    file_id = hash(file_name)
    if index.exists(id=file_id):
        # Fetch embeddings from Pinecone
        VectorStore = index.fetch(ids=[file_id])[file_id]
        st.write('Embeddings Loaded from Pinecone')
    else:
        # Compute embeddings
        embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
        VectorStore = FAISS.from_texts(chunks, embedding=embeddings)

        # Store embeddings in Pinecone
        vectors = VectorStore.get_all_vectors()
        index.upsert(vectors={(file_id, vectors)})
        st.write('Embeddings Computation Completed and Stored in Pinecone')
    
    # Create chat history
    # Pinecone Setup for Chat History
    chat_history_index_name = 'chat_history'
    if chat_history_index_name not in pinecone.list_indexes():
        pinecone.create_index(chat_history_index_name, dimension=1)  # Dimension is 1 as we're not storing vectors here
    chat_history_index = pinecone.Index(chat_history_index_name)

    # Create or Load Chat History from Pinecone
    if pdf:
        # Check if chat history exists in Pinecone
        if chat_history_index.exists(id=pdf.name):
            # Fetch chat history from Pinecone
            chat_history = chat_history_index.fetch(ids=[pdf.name])[pdf.name]
            st.write('Chat History Loaded from Pinecone')
    else:
        # Initialize empty chat history
        chat_history = []

Jasper · November 29, 2023, 8:44am

Hi @zinan.yang

exists() indeed isn’t part of the pinecone index API. Check the API documentation to see relevant available APIs. I would suggest you use fetch() which then returns a vector if it already exist (like you do in the next line). You can also query() with a vector of 0s and use filter to see if at least one of your vectors for that filter exists.

I see you want to split chunks and upload them to Pinecone. Your id here is problematic as many vectors share and ID. This will not work are they will overwrite each other. Put file_id in metadata and create ids with guid/uuid.

After that you can query with a metadata filter of { “file_id” : file_id } and an empty vector and see if you get any results back (I would also suggest setting returnValues and returnMetadata to false).

The short answer to your problem: ‘Index’ object has no attribute ‘exists’ you are using a method that does not exist. Use query to check if vectors exist in your vector database.

Hope this helps

tim · November 29, 2023, 5:19pm

First, if you are storing chat histories per document you should not be making a new index for each! you should use namespaces.

Second, some issue with the current code.
1.

# Create chat history
    # Pinecone Setup for Chat History
    chat_history_index_name = 'chat_history'
    if chat_history_index_name not in pinecone.list_indexes():
        pinecone.create_index(chat_history_index_name, dimension=1)  # Dimension is 1 as we're not storing vectors here
    chat_history_index = pinecone.Index(chat_history_index_name)

The above snippet creates a new index and indexes are not instantly available once created. You need to check if the index is ready before doing anything with it.

 if chat_history_index.exists(id=pdf.name):
            # Fetch chat history from Pinecone
            chat_history = chat_history_index.fetch(ids=[pdf.name])[pdf.name]
            st.write('Chat History Loaded from Pinecone')

.exists is not a function. Indexes have namespaces so what you really want to do it grab all namespaces in the index and check if the desired namespace exists in that list. In python i believe the method is called list_namespaces (or something like that). It will return a list of strings you can do a for...in on.

These are just a few of the problems that I see in this code currently. I think the use of namespaces would make your code much more easy to reason about! Have one index, with many namespaces per document or per chat!

Hope this helps!

zinan.yang · December 6, 2023, 3:37pm

Thank you Jasper. I’ll try what you suggested here and see if I can get it right! Sorry I am really new on this topic and don’t have a strong coding background.

zinan.yang · December 6, 2023, 3:37pm

Thank you Tim! This really helps me to clear things out. I’ll read the API file again and try what you suggested here. Many many thanks!

Jasper · December 6, 2023, 6:54pm

Not a problem at all. That is what the community is here for let us know if you need any more help!

zinan.yang · December 12, 2023, 8:37pm

Hi Jasper,

Sorry for bothering. But I have another question here regarding Pinecone with Langchain. I am just want to keep every simple this time and do a single PDF process. But I got error messages like this:
docsearch = pinecone.from_documents(chunks, embeddings, index_name = index_name)
^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module ‘pinecone’ has no attribute ‘from_documents’

or:
docsearch = pinecone.from_texts([t.page_content for t in chunks], embeddings, index_name = index_name)
^^^^^^^^^^^^^^^^^^^
AttributeError: module ‘pinecone’ has no attribute ‘from_texts’

What did I do wrong here?
Below is my current code for that part:

upload a PDF file

pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# st.write(pdf)

# read PDF
if pdf is not None: 
    pdf_reader = PdfReader(pdf)
    
    # split document into chunks
    # also can use text split: good for PDFs that do not contains charts and visuals
    sections = []
    for page in pdf_reader.pages:
        # Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
        page_sections = page.extract_text().split('\n\n')
        sections.extend(page_sections)

    chunks = sections
    # st.write(chunks)

    ## embeddings
    # Set up Pinecone
    pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
    index_name = 'langchainresearch'
    if index_name not in pinecone.list_indexes():
        pinecone.create_index(index_name, dimension=1536, metric="cosine")  # Adjust the dimension as per your embeddings
    index = pinecone.Index(index_name)
    
    docsearch = pinecone.from_documents(chunks, embeddings, index_name = index_name)

Thank you so much!

Jasper · December 12, 2023, 10:24pm

Hi @zinan.yang

here you are using the wrong python library pinecone library from Pinecone has no from_documents method, but the one you can import from langchain does! So here you need to import the right library (langchain one) and then use the correct one when calling it.

from langchain.vectorstores import Pinecone

## You are using the PIinecone module NOT pinecone!
Pinecone.from_documents(...)

Hope this helps and good luck!

zinan.yang · December 13, 2023, 2:03am

Thank you! I got it figured out and the package is called correctlly.

However, I am a bit confused about this Pinecone.from_documents() function.
Here is what I am passing: docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)

But I got this error message: docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\ZinanYang\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\schema\vectorstore.py”, line 436, in from_documents
texts = [d.page_content for d in documents]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\ZinanYang\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\schema\vectorstore.py”, line 436, in
texts = [d.page_content for d in documents]
^^^^^^^^^^^^^^
AttributeError: ‘str’ object has no attribute ‘page_content’

I am confused since I am not using page_content. My variable chunks are list of strings as pages/paragraph in PDFs.

Thank you.

Jasper · December 13, 2023, 7:27am

Yeah, page_content is a Langchain function that is called somewhere in the from_documents process.

I think your error should have something to do with what you send in as a parameters for from_documents, namely the first one where you send in chunks.

I’ve not used Langchain very much, so in your case I would check their docs here Pinecone | 🦜️🔗 Langchain and see what they are passing in as the first parameter and how they prepare the inserting docs.

The part you would be interested in is probably this one:

from langchain.document_loaders import TextLoader

loader = TextLoader("../../modules/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
...
docsearch = Pinecone.from_documents(docs, embeddings, index_name=index_name)

Hope this helps and good luck!

system · December 27, 2023, 7:28am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.