I’ve just created an index, it would be great if I could retrieve all the IDs in it, in order to only insert new vectors. Not obvious how to do this from the API docs? https://www.pinecone.io/docs/api/operation/fetch/
Hi George, this isn’t possible yet. See this similar request and workaround to this. I’m sharing your post with our product team as yet another vote for such a feature!
Hi Greg, it’s quite a basic basic feature for an advanced commercial tool to not have. The sort of thing that makes using Pinecone feel like a much less polished experience.
We appreciate the feedback, George. We have a lot of great features in the pipeline, some of which we certainly wish we had sooner. If this is a blocker to you taking Pinecone to production send me an email (greg@pinecone.io) and we can work with you on this.
Is this possible yet? Any workaround for it?
It’s over a year later and this is apparently still not possible(?)
I have a scenario where I want to re-process the metadata in the index but don’t have a cheap way to re-analyse the source data to determine the list of keys that would’ve been generated.
I’m frankly amazed I can’t fetch all items or even list all IDs and iterate through them.
There also doesn’t appear to be an dump/export function that would allow me to run a local analysis and extract the keys myself.
Is there really no workaround for this yet? If not, what’s the timeline for a resolution?
+1 I’m facing a similar challenge.
this is a must have feature, i cannot believe it is not supported. Considering you had a bad network, and the re-retries of network operation may insert duplicate items in the database, and it seems you donot have a way to delete those duplicate items…This also means the similiarity search would return duplicate items, and possible cannot hit the real similiar records.!!!
import pinecone
my_index = pinecone.Index("YOUR_INDEX_NAME")
number_of_vectors = int(my_index.describe_index_stats()['namespaces']['YOUR_NAMESPACE_NAME']['vector_count'])
def generate_search_vector(
idx,
vector_dimention=100, min_val=0, max_val = 1
):
if idx == 0:
return [min_val] * vector_dimention
elif idx == 1:
return [max_val] * vector_dimention
elif idx == 2:
return [(max_val-min_val)/2]*vector_dimention
elif 2< idx < (vector_dimention +3):
search_vector = [min_val] * vector_dimention
search_vector[idx - 3] = max_val
return search_vector
all_pinecone_vectors = set()
i = 0
while len(all_pinecone_vectors)< number_of_vectors:
pinecone_documents =my_index.query(
vector=generate_search_vector(i, 768 , 0, 1),
top_k=10000,
include_metadata=False,
include_values=False,
namespace="YOUR_INDEX_NAME",
)
all_pinecone_vectors.update([doc["id"] for doc in pinecone_documents["matches"]])
i+=1
print(f"I currently have {len(all_pinecone_vectors)} startups in Pinecone. It took me {i+1} iterations")
I came up with this hacky approach. Probably the generate_search_vector strategy could be optimized so that it starts with more ‘central’ vectors, but it works without having to change anything in the index and independently from the number of vectors in the DB.
LMK what you think
Facing the same issue.
I totally agree with @GeorgePearse, this is a basic feature for an advanced commercial tool to not have. I took some time to believe that this was not possible.
It is still pretty ridiculous that we have to do this, but here is a workaround:
import numpy as np
def get_ids_from_query(index,input_vector):
print("searching pinecone...")
results = index.query(vector=input_vector, top_k=10000,include_values=False)
ids = set()
print(type(results))
for result in results['matches']:
ids.add(result['id'])
return ids
def get_all_ids_from_index(index, num_dimensions, namespace=""):
num_vectors = index.describe_index_stats()["namespaces"][namespace]['vector_count']
all_ids = set()
while len(all_ids) < num_vectors:
print("Length of ids list is shorter than the number of total vectors...")
input_vector = np.random.rand(num_dimensions).tolist()
print("creating random vector...")
ids = get_ids_from_query(index,input_vector)
print("getting ids from a vector query...")
all_ids.update(ids)
print("updating ids set...")
print(f"Collected {len(all_ids)} ids out of {num_vectors}.")
return all_ids
all_ids = get_all_ids_from_index(index, num_dimensions=1536, namespace="")
print(all_ids)
index = pinecone.Index('jj')
index.query(
top_k=367,
vector= [0] * 1536, # embedding dimension
namespace='',
include_values=True,
filter={'docid': {'$eq': 'uuid here'}} # Your filters here
)
We can use this to fetch list of ids in a namespace
Thanks your algo works
I’m relatively new to Python and the world of LLM application development, transitioning from a background in .NET and SQL where I crafted business applications. So, I apologize in advance for any beginner oversights. This is my hack GitHub - LarryStewart2022/pinecone_Index: Multiple file upload duplicate solution
This is helpful! I want to iterate through that list of ids and pass them to a fetch.
I added this to the bottom of your script:
all_ids = get_all_ids_from_index(index, num_dimensions=1536, namespace=“”)
all_ids_list = [str(id) for id in all_ids]
filtered_ids_list = [id for id in all_ids_list if not (“.txt” in id or “.pdf” in id)]
fetch = index.fetch(ids=filtered_ids_list)
print(fetch)
I get this: UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\u02bc’ in position 5793: character maps to
I have tried iterating through each id and have had some success - it will work for x number of vector and then i get this:
pinecone.core.client.exceptions.ApiAttributeError: FetchResponse has no attribute ‘id’ at [‘[‘received_data’]’][‘id’]
when i take that id and run it individually it works fine.
Thought it might be API timeout, bc i am working in starter, but even batching them I am having issues.
Hello! Everyone who made the request to retrieve all of the vector IDs in an index! I have good news and bad news.
The bad news is, this is still on our roadmap.
The good news is, we have a feature request to track this and to gauge the interest of the Pinecone community to have this feature.
So, if this is something that’s still important to you, please take a moment and go to the feature requests section of the forum and vote for this FR. And while you’re there, please vote for any others that you would like to see. Or create a new one if you have an idea that’s not represented yet.
Thanks!
The Pinecone Support Team
Are you serious? With all due respect. This isn’t simply a matter of how important the function is. That’s the basics, Cory! Go and deliver instead of taking our time asking us to fill things while still having to do workarounds to solve a problem that your team hasn’t been able to address so far. I want to use Pinecone, and if I decide to do so, the only solution is to keep a list of all the IDs on Firestore to properly handle this. Do you really think it’s reasonable? Your post, asking us to repeat ourselves, is simply embarrassing considering how basic this is. I really hope it will be all set by early next year. Happy holidays!
I’m sorry you feel that way, Bruno. And I get how important this feature is for you and other Pinecone customers. That’s why it was one of the first ones I added when I created the Feature Request section of the forum.
Retrieving a list of all of the IDs from the index isn’t as simple as it first sounds. But it is definitely on our roadmap for early 2024.
I appreciate the passion you have for this feature and for Pinecone generally. Trust me when I say we don’t want to keep you waiting any longer than absolutely necessary.
Hey Cory, I want to clarify something. When I referred to it as a “basic feature,” I didn’t mean to imply that it’s simple – far from it. My point was that it’s a fundamental aspect that we expect any database to provide. One of the key reasons we consider a third-party solution is to offload some complexity to a company that can deliver a seamless experience, regardless of how complex achieving it may be. However, many of us are trapped in workarounds because we can’t simply retrieve a list of IDs. I considered Pinecone early this year, I experimented with a lot of other solutions, and I want to build with Pinecone, but I feel that those workarounds shouldn’t be there, and we as developers shouldn’t be constantly pointing this out. Unfortunately, since my first interaction with Pinecone, and almost a year later, the only path forward I can see to work with the vector database that I want is full of workarounds.
I’m glad to hear this is a priority, and I’m sure it will make many of us happy. I can’t wait for this! Ah, even though I don’t need it for now, it might be interesting to consider pagination when implementing the feature.
Thanks for your prompt reply. Now, let’s make Pinecone the world’s best vector database!
Hi all. I’m happy to share that Pinecone now offers a /list
operation for getting the IDs of records in a serverless index. It’s currently limited to the REST API only, with client support coming soon. And as mentioned, the /list
operation works only for serverless indexes, not for pod-based indexes.
You can find more details in our docs:
Note: Beyond returning record IDs, /list
is also useful as part of workflows to manage RAG documents.