I’ve just created an index, it would be great if I could retrieve all the IDs in it, in order to only insert new vectors. Not obvious how to do this from the API docs? https://www.pinecone.io/docs/api/operation/fetch/
Hi George, this isn’t possible yet. See this similar request and workaround to this. I’m sharing your post with our product team as yet another vote for such a feature!
Hi Greg, it’s quite a basic basic feature for an advanced commercial tool to not have. The sort of thing that makes using Pinecone feel like a much less polished experience.
We appreciate the feedback, George. We have a lot of great features in the pipeline, some of which we certainly wish we had sooner. If this is a blocker to you taking Pinecone to production send me an email (greg@pinecone.io) and we can work with you on this.
Is this possible yet? Any workaround for it?
It’s over a year later and this is apparently still not possible(?)
I have a scenario where I want to re-process the metadata in the index but don’t have a cheap way to re-analyse the source data to determine the list of keys that would’ve been generated.
I’m frankly amazed I can’t fetch all items or even list all IDs and iterate through them.
There also doesn’t appear to be an dump/export function that would allow me to run a local analysis and extract the keys myself.
Is there really no workaround for this yet? If not, what’s the timeline for a resolution?
+1 I’m facing a similar challenge.
this is a must have feature, i cannot believe it is not supported. Considering you had a bad network, and the re-retries of network operation may insert duplicate items in the database, and it seems you donot have a way to delete those duplicate items…This also means the similiarity search would return duplicate items, and possible cannot hit the real similiar records.!!!
import pinecone
my_index = pinecone.Index("YOUR_INDEX_NAME")
number_of_vectors = int(my_index.describe_index_stats()['namespaces']['YOUR_NAMESPACE_NAME']['vector_count'])
def generate_search_vector(
idx,
vector_dimention=100, min_val=0, max_val = 1
):
if idx == 0:
return [min_val] * vector_dimention
elif idx == 1:
return [max_val] * vector_dimention
elif idx == 2:
return [(max_val-min_val)/2]*vector_dimention
elif 2< idx < (vector_dimention +3):
search_vector = [min_val] * vector_dimention
search_vector[idx - 3] = max_val
return search_vector
all_pinecone_vectors = set()
i = 0
while len(all_pinecone_vectors)< number_of_vectors:
pinecone_documents =my_index.query(
vector=generate_search_vector(i, 768 , 0, 1),
top_k=10000,
include_metadata=False,
include_values=False,
namespace="YOUR_INDEX_NAME",
)
all_pinecone_vectors.update([doc["id"] for doc in pinecone_documents["matches"]])
i+=1
print(f"I currently have {len(all_pinecone_vectors)} startups in Pinecone. It took me {i+1} iterations")
I came up with this hacky approach. Probably the generate_search_vector strategy could be optimized so that it starts with more ‘central’ vectors, but it works without having to change anything in the index and independently from the number of vectors in the DB.
LMK what you think
Facing the same issue.
I totally agree with @GeorgePearse, this is a basic feature for an advanced commercial tool to not have. I took some time to believe that this was not possible.
It is still pretty ridiculous that we have to do this, but here is a workaround:
import numpy as np
def get_ids_from_query(index,input_vector):
print("searching pinecone...")
results = index.query(vector=input_vector, top_k=10000,include_values=False)
ids = set()
print(type(results))
for result in results['matches']:
ids.add(result['id'])
return ids
def get_all_ids_from_index(index, num_dimensions, namespace=""):
num_vectors = index.describe_index_stats()["namespaces"][namespace]['vector_count']
all_ids = set()
while len(all_ids) < num_vectors:
print("Length of ids list is shorter than the number of total vectors...")
input_vector = np.random.rand(num_dimensions).tolist()
print("creating random vector...")
ids = get_ids_from_query(index,input_vector)
print("getting ids from a vector query...")
all_ids.update(ids)
print("updating ids set...")
print(f"Collected {len(all_ids)} ids out of {num_vectors}.")
return all_ids
all_ids = get_all_ids_from_index(index, num_dimensions=1536, namespace="")
print(all_ids)
index = pinecone.Index('jj')
index.query(
top_k=367,
vector= [0] * 1536, # embedding dimension
namespace='',
include_values=True,
filter={'docid': {'$eq': 'uuid here'}} # Your filters here
)
We can use this to fetch list of ids in a namespace
Thanks your algo works
I’m relatively new to Python and the world of LLM application development, transitioning from a background in .NET and SQL where I crafted business applications. So, I apologize in advance for any beginner oversights. This is my hack GitHub - LarryStewart2022/pinecone_Index: Multiple file upload duplicate solution
This is helpful! I want to iterate through that list of ids and pass them to a fetch.
I added this to the bottom of your script:
all_ids = get_all_ids_from_index(index, num_dimensions=1536, namespace=“”)
all_ids_list = [str(id) for id in all_ids]
filtered_ids_list = [id for id in all_ids_list if not (“.txt” in id or “.pdf” in id)]
fetch = index.fetch(ids=filtered_ids_list)
print(fetch)
I get this: UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\u02bc’ in position 5793: character maps to
I have tried iterating through each id and have had some success - it will work for x number of vector and then i get this:
pinecone.core.client.exceptions.ApiAttributeError: FetchResponse has no attribute ‘id’ at [‘[‘received_data’]’][‘id’]
when i take that id and run it individually it works fine.
Thought it might be API timeout, bc i am working in starter, but even batching them I am having issues.
Hello! Everyone who made the request to retrieve all of the vector IDs in an index! I have good news and bad news.
The bad news is, this is still on our roadmap.
The good news is, we have a feature request to track this and to gauge the interest of the Pinecone community to have this feature.
So, if this is something that’s still important to you, please take a moment and go to the feature requests section of the forum and vote for this FR. And while you’re there, please vote for any others that you would like to see. Or create a new one if you have an idea that’s not represented yet.
Thanks!
The Pinecone Support Team