I am new to Pinecone and learning through out the way. I am creating a PDF reader application with LangChain and Pinecone. In one section of my code where I want to split the PDFs user upload into chunks and store them into Pinecone. But I only want to create a new embedding where user upload a new PDF. And I keep getting this error: AttributeError: ‘Index’ object has no attribute ‘exists’.
Can someone please help me. Thank you!
Here is the section of my code where handles this:
read PDF
if pdf is not None:
pdf_reader = PdfReader(pdf)
# split document into chunks
# also can use text split: good for PDFs that do not contains charts and visuals
sections = []
for page in pdf_reader.pages:
# Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
page_sections = page.extract_text().split('\n\n')
sections.extend(page_sections)
chunks = sections
# st.write(chunks)
## embeddings
# Set up Pinecone
pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
index_name = 'langchainresearch'
if index_name not in pinecone.list_indexes():
pinecone.create_index(index_name, dimension=1536, metric="cosine") # Adjust the dimension as per your embeddings
index = pinecone.Index(index_name)
file_name = pdf.name[:-4]
# Check if embeddings are already stored in Pinecone
file_id = hash(file_name)
if index.exists(id=file_id):
# Fetch embeddings from Pinecone
VectorStore = index.fetch(ids=[file_id])[file_id]
st.write('Embeddings Loaded from Pinecone')
else:
# Compute embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
VectorStore = FAISS.from_texts(chunks, embedding=embeddings)
# Store embeddings in Pinecone
vectors = VectorStore.get_all_vectors()
index.upsert(vectors={(file_id, vectors)})
st.write('Embeddings Computation Completed and Stored in Pinecone')
# Create chat history
# Pinecone Setup for Chat History
chat_history_index_name = 'chat_history'
if chat_history_index_name not in pinecone.list_indexes():
pinecone.create_index(chat_history_index_name, dimension=1) # Dimension is 1 as we're not storing vectors here
chat_history_index = pinecone.Index(chat_history_index_name)
# Create or Load Chat History from Pinecone
if pdf:
# Check if chat history exists in Pinecone
if chat_history_index.exists(id=pdf.name):
# Fetch chat history from Pinecone
chat_history = chat_history_index.fetch(ids=[pdf.name])[pdf.name]
st.write('Chat History Loaded from Pinecone')
else:
# Initialize empty chat history
chat_history = []
exists() indeed isn’t part of the pinecone index API. Check the API documentation to see relevant available APIs. I would suggest you use fetch() which then returns a vector if it already exist (like you do in the next line). You can also query() with a vector of 0s and use filter to see if at least one of your vectors for that filter exists.
I see you want to split chunks and upload them to Pinecone. Your id here is problematic as many vectors share and ID. This will not work are they will overwrite each other. Put file_id in metadata and create ids with guid/uuid.
After that you can query with a metadata filter of { “file_id” : file_id } and an empty vector and see if you get any results back (I would also suggest setting returnValues and returnMetadata to false).
The short answer to your problem: ‘Index’ object has no attribute ‘exists’ you are using a method that does not exist. Use query to check if vectors exist in your vector database.
First, if you are storing chat histories per document you should not be making a new index for each! you should use namespaces.
Second, some issue with the current code.
1.
# Create chat history
# Pinecone Setup for Chat History
chat_history_index_name = 'chat_history'
if chat_history_index_name not in pinecone.list_indexes():
pinecone.create_index(chat_history_index_name, dimension=1) # Dimension is 1 as we're not storing vectors here
chat_history_index = pinecone.Index(chat_history_index_name)
The above snippet creates a new index and indexes are not instantly available once created. You need to check if the index is ready before doing anything with it.
if chat_history_index.exists(id=pdf.name):
# Fetch chat history from Pinecone
chat_history = chat_history_index.fetch(ids=[pdf.name])[pdf.name]
st.write('Chat History Loaded from Pinecone')
.exists is not a function. Indexes have namespaces so what you really want to do it grab all namespaces in the index and check if the desired namespace exists in that list. In python i believe the method is called list_namespaces (or something like that). It will return a list of strings you can do a for...in on.
These are just a few of the problems that I see in this code currently. I think the use of namespaces would make your code much more easy to reason about! Have one index, with many namespaces per document or per chat!
Thank you Jasper. I’ll try what you suggested here and see if I can get it right! Sorry I am really new on this topic and don’t have a strong coding background.
Sorry for bothering. But I have another question here regarding Pinecone with Langchain. I am just want to keep every simple this time and do a single PDF process. But I got error messages like this:
docsearch = pinecone.from_documents(chunks, embeddings, index_name = index_name)
^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module ‘pinecone’ has no attribute ‘from_documents’
or:
docsearch = pinecone.from_texts([t.page_content for t in chunks], embeddings, index_name = index_name)
^^^^^^^^^^^^^^^^^^^
AttributeError: module ‘pinecone’ has no attribute ‘from_texts’
What did I do wrong here?
Below is my current code for that part:
upload a PDF file
pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# st.write(pdf)
# read PDF
if pdf is not None:
pdf_reader = PdfReader(pdf)
# split document into chunks
# also can use text split: good for PDFs that do not contains charts and visuals
sections = []
for page in pdf_reader.pages:
# Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
page_sections = page.extract_text().split('\n\n')
sections.extend(page_sections)
chunks = sections
# st.write(chunks)
## embeddings
# Set up Pinecone
pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
index_name = 'langchainresearch'
if index_name not in pinecone.list_indexes():
pinecone.create_index(index_name, dimension=1536, metric="cosine") # Adjust the dimension as per your embeddings
index = pinecone.Index(index_name)
docsearch = pinecone.from_documents(chunks, embeddings, index_name = index_name)
here you are using the wrong python library pinecone library from Pinecone has no from_documents method, but the one you can import from langchain does! So here you need to import the right library (langchain one) and then use the correct one when calling it.
from langchain.vectorstores import Pinecone
## You are using the PIinecone module NOT pinecone!
Pinecone.from_documents(...)
Thank you! I got it figured out and the package is called correctlly.
However, I am a bit confused about this Pinecone.from_documents() function.
Here is what I am passing: docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
But I got this error message: docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\ZinanYang\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\schema\vectorstore.py”, line 436, in from_documents
texts = [d.page_content for d in documents]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\ZinanYang\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\schema\vectorstore.py”, line 436, in
texts = [d.page_content for d in documents]
^^^^^^^^^^^^^^
AttributeError: ‘str’ object has no attribute ‘page_content’
I am confused since I am not using page_content. My variable chunks are list of strings as pages/paragraph in PDFs.
Yeah, page_content is a Langchain function that is called somewhere in the from_documents process.
I think your error should have something to do with what you send in as a parameters for from_documents, namely the first one where you send in chunks.
I’ve not used Langchain very much, so in your case I would check their docs here Pinecone | 🦜️🔗 Langchain and see what they are passing in as the first parameter and how they prepare the inserting docs.
The part you would be interested in is probably this one: