Hi Cory, all,
Sharing here the reason that brought me to this post for your reference. I’ll try the workaround, please also redirect me if my seemingly simple use case is solved in a more direct way.
I have a small index result of loading one big PDF file. I want to query the entire index without rerunning embeddings, but when I look at fetch or query the list of IDs is a requirement I hit up against. So, IDK, my thinking was I’ll try to feed it the IDs it wants if it’s possible or a parameter for all if that exists and doesn’t hit other limits.
I can rerun my whole script that loads the whole PDF text into the index, and do what I need to successfully, but can I skip that load to pinecone (running embeddings again) step and just fetch/query the whole index?
Basically, I’m trying to replace this statement setting my docsearch variable below with something that uses the existing index ‘jj’ that already has my PDF texts stored as vectors. This is part of code using langchain and pinecone and an LLM for the embeddings and similarity.
docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name='jj')
so I can proceed to:
query = "(for example...) What are the key recommendations made in this document?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)
using top_k in a query isn’t working for me without specifying IDs. Query seems to still require a list of Index IDs as an argument. Without it, my query hits an exception that generates a 400 bad API request error.
So it’s sounding to me like I would have to assign unique IDs upon upload to the index in order to know what’s what there, and to have the ability to recall anything repeatedly once it’s there. Except I don’t really want to save a store of those unique IDs elsewhere in parallel. I’m a bit uncertain whether pinecone will help with avoiding repeatedly embedding vectors over and over which was one of my main hopes in using a vector db. Following!
index = pinecone.Index('jj')
index.query(
top_k=367,
namespace='',
include_values=True
)
I tried this a number of ways, including only with ‘top_k=367’ (367 being my vector count in the whole index)
index.query(
top_k=367
)
and with or without the other combinations of namespace='' and/or including data values
but it returns:
ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({‘content-type’: ‘application/json’, ‘date’: ‘Thu, 30 Mar 2023 15:05:54 GMT’, ‘x-envoy-upstream-service-time’: ‘1’, ‘content-length’: ‘53’, ‘server’: ‘envoy’, ‘connection’: ‘close’})
HTTP response body: {“code”:3,“message”:“No query provided”,“details”:[]}
I’ve been using a zero vector for the vector argument. Using @kathyh’s example, if your embedding size is 1536:
index = pinecone.Index('jj')
index.query(
top_k=367,
vector= [0] * 1536, # embedding dimension
namespace='',
include_values=True,
filter={'docid': {'$eq': 'uuid here'}} # Your filters here
)
Seems to be working ok in my testing, but I’m not really sure if it’s completely accurate…and still constrained to the top_k limits. You can also add the metadata filter as needed. Perhaps @Cory_Pinecone can comment if this is the official recommended way?
Another scenario: Suppose a user stores a bunch of ‘Document’ objects in pinecone, say 10,500.
What’s the recommended way to retrieve all his document objects to display in a UI? index.query() with top_k = 10,500 and filter matching user.id? But it will only return the first 10,000 due to the limits right? (Or only 1000 if returning metadata or values!)
Is there a “paging” function where I can retrieve in chunks? Eg, to retrieve items #20 to #29: num=10, start=20?
1000 or even 10,000 limit when the intent of pinecone is to store millions of objects/vectors is quite limiting, (no pun intended!)
Here’s my use case. I just loaded a whole bunch of legume abstracts from PubMed. But, unfortunately, the legume genus “Lens” is also a word for the optical device. So, I’d like to pull all the IDs from my index, then list the titles from metadata, then remove the ones that are about lenses rather than Lens plants. Yes, I could query “Lens” with a large top-k and hope I find them all, but I’d rather just plow through the article titles and get it done properly.
My way: Query with top_k=10000, save the ids, then update their metadata with a field to distinguish the query (eg: “status”: “saved_id”). And continue the query with filter metadata {“status”: {“$nin”:[“saved_id”]}} and then update metadata. Repeat.
What is the recommended approach to avoid inserting duplicate data in the database?
As I can’t get the IDs, how would I check if a document already exists in the database?
Hi! In our case, we have a complex, dynamic knowledge base that changes on a daily basis. I need to perform regular syncs to pinecone and to do that I need to know what vectors to add, what vectors to delete and what vectors to modify. In other words, I need the whole list of vector ids
My use case is yielding similar challenge. Since querying requires a vector (as far as I know, I cannot query by ID) and fetching requires an exact ID, the only solution I can envisage as of today is to create a parallel ID database, which is crazy rework IMHO and will affect app performance… As a minimum, I would like to be able to fetch by ID using wildcards (yes, this would actually help in my specific use case). But as of now this is a bit of a showstopper.
I’m relatively new to Python and the world of LLM application development, transitioning from a background in .NET and SQL where I crafted business applications. So, I apologize in advance for any beginner oversights. I am using (langchain.document_loaders.unstructured.UnstructuredAPIFileLoader — 🦜🔗 LangChain 0.0.266) UnstructuredAPIFileLoader from my own server locally, so you will have to change it. I intend to adjust the code to give users the choice of retaining the file. Given that I’m processing numerous files in batches, I’d prefer no disruptions. Files that aren’t incorporated will be set aside for future examination. GitHub - LarryStewart2022/pinecone_Index: Multiple file upload duplicate solution
I mirror everyone’s sentiment here as how can you refresh your VectorDB without the ability to compare what you have to what’s new and what’s changed. Yes you have the Upsert function that should do that automatically, but then how do you use that with say the GPTVectorStoreIndex or VectorStoreIndex that is creating the embeddings in the first place and in my case mine are transformed with Hugging face to be even better embeddings? So can someone post a way to do this as it myst be possible for an existing Index in Pinecone surely? If not how can this even be a solution and what alternative would you use instead? Weaviate or ChromaDb and also how as its kind of the same problem with all Vector Indexes as well no? I have got a unique ID in my Vectors stored as metadata so it should be possible to pull back all the documents in my Vector DB using the metadata, but what a pullava and how would i still compare the vector ID’s to handle changed data from say a Llama Index data loader?
This is a feature we’ve talked about internally a lot here. Since many of our users have an extremely high amount of vectors in production, we are figuring out the best way to enable this feature that both allows it to scale (both in compute & cost) and allows users to easily implement it.
But fear not, we hear you & it’s definitely on our roadmap. We will update this thread as soon as it’s available!
I have done this for namespaces between 200K and 750K large without issues in order to “replicate” the index locally on alternative vectorstores. This is the script that is used to the project.
It does what people in this thread are recommending. Empty vector with a tracking id that is updated each loop so I can “walk” the entire database. From there, once its in the tool I can migrate the data to wherever I want to test locally or just replicate without messing up production data on Pinecone.
For reference, here is the code snippet so you can see how I currently do this.
Cavet: You need a namespace enabled index to use this script because the delay between updates on the free-starter tier can take 30+ seconds to save. Additional I have not tested this over 750K vectors! Your mileage may vary
The Github project i linked can “clone” namespaces or entire indexes from Pinecone into the same index under a new name.
You’ll still need to “walk” the entire vectorspace in that specific namespace, but it works all the same. The way to do it via code is actually the exact same as the snippet above.
You basically need to walk the vectors you want to copy and store them locally.
Then loop over the stored vectors and upsert into Pinecone with a new ID and the new namespace
Voila - that’s it!.
That way you arent paying for embedding twice or anything. From what I know, there is no way to make a collection snapshot of a namespace or simply clone/duplicate a namespace from the SDK so this is what we do in the meantime.
Thanks Tim, I’m using Vector Admin to cloning a namespace/Workspace right now! It’s a bit slow for 6K vectors, but perhaps Pinecone will offer a richer API. Great work on Vector Admin!