What am I doing wrong?
Is there a vector with id “1”? How did you set the ids when upserting them?
I’m not sure!
This is how:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone
loader = OnlinePDFLoader("xxx")
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(data)
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
index_name = "test"
# initialize pinecone
pinecone.init(
api_key=PINECONE_API_KEY, # find at app.pinecone.io
environment=PINECONE_API_ENV # next to api key in console
)
docsearch = Pinecone.from_texts(
[t.page_content for t in texts],
embeddings,
index_name=index_name,
namespace="test2"
)
I’m trying to create 2 seperate files, an “ingest” file loads a PDF and it’s embeddings into Pinecone, and a “query” file where a user can provide any query and it searches for relevant chunks on Pinecone and then uses that to query ChatGPT.
But I’m really struggling to find all the necessary documentation. One thing I Googled today literally only returned 1 result from the Lanngchain documentation
I’m not sure! This is how:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone
loader = OnlinePDFLoader("x")
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(data)
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
index_name = "test"
# initialize pinecone
pinecone.init(
api_key=PINECONE_API_KEY, # find at app.pinecone.io
environment=PINECONE_API_ENV # next to api key in console
)
docsearch = Pinecone.from_texts(
[t.page_content for t in texts],
embeddings,
index_name=index_name,
namespace="test2"
)
I’m actually trying to create 2 seperate files. An “ingest” file where it creates chunks and embeddings from the PDF and stores it to Pinecone. And a “query” file where for any user input it queries Pinecone for the relevant chunks, returns those and then queries ChatGPT with the chunks and original query for an answer. But I’m finding it very hard to find all the information and documentation I need to accomplish this.
I did give all the documentation to ChatGPT I could find and asked it to make the ingest file, but I’m not sure how close it got as it seems to have made up functions, e.g. deinit.
# ingest.py
from langchain.document_loaders import OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone
OPENAI_API_KEY = "your-openai-api-key"
PINECONE_API_KEY = "your-pinecone-api-key"
PINECONE_API_ENV = "your-pinecone-api-environment"
loader = OnlinePDFLoader("xxx")
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(data)
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
index_name = "test"
# Initialize Pinecone
pinecone.deinit()
pinecone.init(
api_key=PINECONE_API_KEY,
environment=PINECONE_API_ENV
)
# Create Pinecone index
pinecone.create_index(index_name=index_name, dimension=embeddings.embedding_dim, metric="cosine")
# Store embeddings
embedding_map = {str(i): embeddings.embed_text(t.page_content) for i, t in enumerate(texts)}
pinecone.fetch(index_name).upsert(items=embedding_map)
pinecone.deinit()
I see! not the way I went about it, but I guess something worked!
I would suggest looking at the api reference at: Upsert
There you have examples how to upsert the data. For reference I am adding my own snippet that I used (you could do it in a oneliner, but I used the clearer way):
import pinecone
pinecone.init(api_key="...", environment="us-west1-gcp")
index = pinecone.Index("knowledgebase")
data = []
for id,embedding in enumerate(embeddings):
metadata = {'metadata1': metadata_value}
data.append((id, embedding, metadata))
index.upsert(data, namespace='namespace1')
Hope this helps
Thank you very much, it’s nearly midnight here in Perth, Australia. I’ll have a look tomorrow morning Any idea what my ID’s might be using the code I gave you I generated?
Based on this: Pinecone — 🦜🔗 LangChain 0.0.121
Based on langchain sourcecode on github (langchain/pinecone.py at master · hwchase17/langchain · GitHub) this is the code for upserting:
ids = ids or [str(uuid.uuid4()) for _ in texts]
for i, text in enumerate(texts):
embedding = self._embedding_function(text)
metadata = metadatas[i] if metadatas else {}
metadata[self._text_key] = text
docs.append((ids[i], embedding, metadata))
# upsert to Pinecone
self._index.upsert(vectors=docs, namespace=namespace, batch_size=batch_size)
So by what I am seeing here… you will not be able to fetch these vectors by ids (unless you Query first and then use the vector id, but by then you already have the vectors). Ids are uuid4s so they aren’t set by any “simple” i+=1 logic.
What I suggest is do not Fetch by id. Unless you want to just check if the vectors are there, just use Query get a new vector and query your index by that vector so you get the results back (the similar vectors).
Good luck!
You are an absolute legend! Thank you so much! Wish I was working around or with people like you who are knowledgeable on this.
You’re right, just getting the basics sorted and just wanted to verify my data in Pinecone!
Didn’t realise my ID’s were being generated that way. Just did a query and found my first ID, fd7079af-bbc9-43d7-b79c-a765067efc35, much more to look into.
Hope I can provide some updates with how I’m going
I am having a similar problem. I just activated and created a new index and used /upsert to create one vector and then I tried to read it back via /fetch.
The upsert gave a success message upsertCount: 1. But when I go to fetch it back I get “message”:“”.
This is what I posted in the upsert:
{“vectors”:[{“id”:“9740186f-19c9-4065-bc34-2a1f879c4b31”,“metadata”:{“model”:“text-embedding-ada-002-v2”},“values”:[[0.011733273,-0.003386857,0.00013826403,-0.03273309,-0.027460298,0.023714524,-0.010043108,0.0044505517,-0.012509836,-0.015035296,0.0014005861,0.0047376845,0.009703769,0.0063169124,0.016549267,0.0034357999,0.02163934,-0.016745038,0.0038632357,
many lines deleted
-0.0354739,-0.0014258733,0.00043069856,-0.0015474152,0.0048812507,-0.019355332,0.023427391,0.023296878,0.007980975,0.008209376,-0.027956253,-0.01897684,-0.014160847,0.0011827897,-0.020295039,-0.0190682,-0.007485019]]}],“namespace”:“”}
And when I try to fetch I pass in “9740186f-19c9-4065-bc34-2a1f879c4b31”. I have also tried 1 and * and only get the same “message”:“”.
I’m sure I’m doing something obviously wrong…but what?
Hi @roland_alden,
It’s not super obvious, but fetch()
expects a list of IDs, not just one. So if you use fetch(ids=['9740186f-19c9-4065-bc34-2a1f879c4b31'])
that should work.
I was all optimistic but alas, it did not work:
/fetch?ids=[‘9740186f-19c9-4065-bc34-2a1f879c4b31’]
I even tried
/fetch?ids=[‘9740186f-19c9-4065-bc34-2a1f879c4b31’,’ ']
No joy.
I tried /vectors/list too and it returns the same “message”:“”.
I did a query with the sample given (all 0’s) and get back an answer that seems correct. This would seem to suggest my original upsert did work…
This is what comes back from the query:
{
"results": [],
"matches": [
{
"id": "9740186f-19c9-4065-bc34-2a1f879c4b31",
"score": 0,
"values": [
0.010437889,
-0.00521894451,
0,
-0.0313136689,
-0.0260947235,
many lines deleted
-0.020875778,
-0.00521894451
],
"metadata": {
"model": "text-embedding-ada-002-v2"
}
}
],
"namespace": ""
}
Ah, you’re using the REST API. Sorry about that. I thought you were using the Python client.
The problem is the REST API expects there to be a namespace included in the request, but passing a blank namespace won’t be processed properly. Try upserting your vectors again, but this time assign them to a namespace.
Once you’ve done so, your fetch will work with the namespace appended:
/fetch?ids=9740186f-19c9-4065-bc34-2a1f879c4b31&namespace=yournamespace
Using namespaces is a good habit to get into and will unlock a lot of flexibility with Pinecone. Including not having to run multiple indexes for a small project.
Ah ha! I will do that. Do you have any suggested “best practices” for namespaces and naming namespaces. I read about them and they seemed to be just a way to chop up an index. I was wondering if indexes were “heavy weight” and thus subdividing one made a lot of sense, etc.
Oh. And is there any way to delete the (now) two items I have inserted with the “” namespace? Somehow I suspect a delete REST API might not work. I don’t mind nuking the index and starting over if that’s the easiest thing to do.
Update. I did delete the index and recreated it with a namespace string. However I still cannot get the /fetch or /list endpoints to work.
The /vectors/upsert and the /query endpoints work fine so I know for certain that the vectors I have stored are there. However I really would like to read back a vector by id and retrieve a timestamp of when it was last updated.