How to retrieve list of IDs in an index

I’ve just created an index, it would be great if I could retrieve all the IDs in it, in order to only insert new vectors. Not obvious how to do this from the API docs? https://www.pinecone.io/docs/api/operation/fetch/

5 Likes

Hi George, this isn’t possible yet. See this similar request and workaround to this. I’m sharing your post with our product team as yet another vote for such a feature!

1 Like

Hi Greg, it’s quite a basic basic feature for an advanced commercial tool to not have. The sort of thing that makes using Pinecone feel like a much less polished experience.

2 Likes

We appreciate the feedback, George. We have a lot of great features in the pipeline, some of which we certainly wish we had sooner. If this is a blocker to you taking Pinecone to production send me an email (greg@pinecone.io) and we can work with you on this.

Is this possible yet? Any workaround for it?

It’s over a year later and this is apparently still not possible(?)

I have a scenario where I want to re-process the metadata in the index but don’t have a cheap way to re-analyse the source data to determine the list of keys that would’ve been generated.

I’m frankly amazed I can’t fetch all items or even list all IDs and iterate through them.

There also doesn’t appear to be an dump/export function that would allow me to run a local analysis and extract the keys myself.

Is there really no workaround for this yet? If not, what’s the timeline for a resolution?

3 Likes

+1 I’m facing a similar challenge.

this is a must have feature, i cannot believe it is not supported. Considering you had a bad network, and the re-retries of network operation may insert duplicate items in the database, and it seems you donot have a way to delete those duplicate items…This also means the similiarity search would return duplicate items, and possible cannot hit the real similiar records.!!!

import pinecone

my_index = pinecone.Index("YOUR_INDEX_NAME")

number_of_vectors = int(my_index.describe_index_stats()['namespaces']['YOUR_NAMESPACE_NAME']['vector_count'])


def generate_search_vector(
    idx,
    vector_dimention=100, min_val=0, max_val = 1
):
    if idx == 0:
        return [min_val] * vector_dimention
    elif idx == 1:
        return [max_val] * vector_dimention
    elif idx == 2:
        return [(max_val-min_val)/2]*vector_dimention
    elif  2< idx < (vector_dimention +3):
        search_vector = [min_val] * vector_dimention
        search_vector[idx - 3] = max_val
        return search_vector

all_pinecone_vectors = set()

i = 0

while len(all_pinecone_vectors)< number_of_vectors:

    pinecone_documents =my_index.query(
    vector=generate_search_vector(i, 768 , 0, 1),
    top_k=10000,
    include_metadata=False,
    include_values=False,
    namespace="YOUR_INDEX_NAME",
) 

    all_pinecone_vectors.update([doc["id"] for doc in pinecone_documents["matches"]]) 

    i+=1


print(f"I currently have {len(all_pinecone_vectors)} startups in Pinecone. It took me {i+1} iterations")

I came up with this hacky approach. Probably the generate_search_vector strategy could be optimized so that it starts with more ‘central’ vectors, but it works without having to change anything in the index and independently from the number of vectors in the DB.

LMK what you think

:wink:

Facing the same issue.
I totally agree with @GeorgePearse, this is a basic feature for an advanced commercial tool to not have. I took some time to believe that this was not possible. :face_with_diagonal_mouth:

1 Like

It is still pretty ridiculous that we have to do this, but here is a workaround:

import numpy as np
def get_ids_from_query(index,input_vector):
  print("searching pinecone...")
  results = index.query(vector=input_vector, top_k=10000,include_values=False)
  ids = set()
  print(type(results))
  for result in results['matches']:
    ids.add(result['id'])
  return ids

def get_all_ids_from_index(index, num_dimensions, namespace=""):
  num_vectors = index.describe_index_stats()["namespaces"][namespace]['vector_count']
  all_ids = set()
  while len(all_ids) < num_vectors:
    print("Length of ids list is shorter than the number of total vectors...")
    input_vector = np.random.rand(num_dimensions).tolist()
    print("creating random vector...")
    ids = get_ids_from_query(index,input_vector)
    print("getting ids from a vector query...")
    all_ids.update(ids)
    print("updating ids set...")
    print(f"Collected {len(all_ids)} ids out of {num_vectors}.")

  return all_ids

all_ids = get_all_ids_from_index(index, num_dimensions=1536, namespace="")
print(all_ids)



3 Likes
index = pinecone.Index('jj')
index.query(
    top_k=367,
    vector= [0] * 1536, # embedding dimension
    namespace='',
    include_values=True,
    filter={'docid': {'$eq': 'uuid here'}} # Your filters here
)

We can use this to fetch list of ids in a namespace

Thanks your algo works :slight_smile:

1 Like

I’m relatively new to Python and the world of LLM application development, transitioning from a background in .NET and SQL where I crafted business applications. So, I apologize in advance for any beginner oversights. This is my hack GitHub - LarryStewart2022/pinecone_Index: Multiple file upload duplicate solution

This is helpful! I want to iterate through that list of ids and pass them to a fetch.

I added this to the bottom of your script:

all_ids = get_all_ids_from_index(index, num_dimensions=1536, namespace=“”)

all_ids_list = [str(id) for id in all_ids]

filtered_ids_list = [id for id in all_ids_list if not (“.txt” in id or “.pdf” in id)]

fetch = index.fetch(ids=filtered_ids_list)

print(fetch)

I get this: UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\u02bc’ in position 5793: character maps to

I have tried iterating through each id and have had some success - it will work for x number of vector and then i get this:
pinecone.core.client.exceptions.ApiAttributeError: FetchResponse has no attribute ‘id’ at [‘[‘received_data’]’][‘id’]

when i take that id and run it individually it works fine.

Thought it might be API timeout, bc i am working in starter, but even batching them I am having issues.

Hello! Everyone who made the request to retrieve all of the vector IDs in an index! I have good news and bad news.

The bad news is, this is still on our roadmap.

The good news is, we have a feature request to track this and to gauge the interest of the Pinecone community to have this feature.

So, if this is something that’s still important to you, please take a moment and go to the feature requests section of the forum and vote for this FR. And while you’re there, please vote for any others that you would like to see. Or create a new one if you have an idea that’s not represented yet.

Thanks!

The Pinecone Support Team