Returning list of IDs

sophiem · February 3, 2022, 3:18pm

Is there some way to return a list of all the IDs in an index?

jamesbriggs · February 3, 2022, 4:27pm

Not yet. A hacky way to do this would be to query with top_k = size_of_index. As of this writing, the max value for top_k — the number of results to return — is 10,000. Max value for top_k for queries with include_metadata=True or include_data=True is 1,000. See our docs on limits for more info.

GeorgePearse · May 18, 2022, 8:05am

Most engineers using Pinecone will be working with datasets much larger than 10,000. So sadly I don’t think the “the hacky way” is a real option.

bub · January 3, 2023, 2:19pm

Hi,

I and my company are interested in the functionality of getting all IDs in an index. Any updates recently to support this?

kbutler · January 3, 2023, 4:02pm

This is not a current feature of the product. However, it has been a popular request, and our product team is aware of it. Soon, we will have more tools available regarding the import/export of data. This may satisfy the requirement by allowing you to export the entire database into a cloud storage bucket for backup/validation/verification purposes.

TravisBarton · February 9, 2023, 5:55am

Yeah I need this too

mick-zelta · February 16, 2023, 5:42pm

Need this too. Would be huge for Clustering!

Cory_Pinecone · February 21, 2023, 6:54pm

Hey all! We’ve heard y’all’s request for returning IDs, and it’s something we’re actively researching how to do. There are some technical challenges associated with doing so, though. Not to get too lost in the weeds, but Pinecone doesn’t sequentially store vectors such that generating a list of IDs would be trivial. They’re stored in a much more vector-friendly and -native way, so traversing over the dataset doesn’t necessarily result in a list of IDs that are anywhere close to each other. But it is something we’ve heard from the community, so we are working on a way to provide this. Please stay tuned!

In the meantime, we would really appreciate some examples of when y’all’d use this feature. For instance, @mick-zelta, you mentioned clustering. How would you make use of a list of IDs when doing that? If you can provide some additional nuance, it would help us make sure we nail exactly what you’re looking for.

cmishra · March 7, 2023, 10:56pm

I have a data pipeline that results in vectors added to a pinecone instance. I want to see what actually made it all the way through and is in the DB. Unfortunately my data set size is also well past 10k so the “hacky” solution doesn’t really work.

jejanov · March 23, 2023, 6:29pm

I may be mistaken but I think this is essential to being able to add additional vectors at a later time, knowing for sure that you aren’t overwriting your previous vectors. As it stands, I’m not clear on how I ensure the next time I go to add a vector to the Index that I’m picking a value for the ID that hasn’t been taken an therefore overwriting the previous vectors. Was this designed with the intended use case that all vector uploads would happen at once? Or is there a way to ensure your sequencing of ID’s is guaranteed not to be duplicative? (similar to SQL’s auto id) This seems pretty basic to me. Thats for all that you are doing. Just chiming in on a use case as requested. The lack of this is a production ready blocker for me.

Cory_Pinecone · March 23, 2023, 9:02pm

Hi @jejanov,

The usual way to handle this problem is to use UUIDs for your vector IDs. There are a lot of various UUID libraries out there but they all boil down to generating unique IDs with some measure of confidence. The math to do so can be tricky to implement on a database (especially an eventually-consistent one like Pinecone) for reasons related to data consistency, so it’s better to keep this function in the app itself.

If you’re using Python, you can use the built-in library uuid (import uuid) and use that to generate your vector IDs. Full details on using it are available in the Python docs.

jejanov · March 24, 2023, 1:08pm

Cory. Thank you. I was considering this approach. The mental model of sequential data entry kept me from it as I thought it important to have some identifiable sequence and readable ID. I’ll give this a go.

TravisBarton · March 30, 2023, 5:26am

Heres an easy use case:
My customers store documents in our db and query against them for relevance (hence pinecone). At any given time, we can only say, “we’re sure that the documents we have listed are in Pinecone, but we dont know if by error or by bad design if theres other documents floating around in our database”.

IF there are information pieces floating inside a customers knowledge base without their control, that spells huge trouble.

The easy response is, “code better systems” but this is a hyperbolic scenario. In truth, the lack of functionality makes it REALLY hard to debug what exactly is in our database in many ways.

kathyh · March 30, 2023, 1:43pm

Hi Cory, all,
Sharing here the reason that brought me to this post for your reference. I’ll try the workaround, please also redirect me if my seemingly simple use case is solved in a more direct way.

I have a small index result of loading one big PDF file. I want to query the entire index without rerunning embeddings, but when I look at fetch or query the list of IDs is a requirement I hit up against. So, IDK, my thinking was I’ll try to feed it the IDs it wants if it’s possible or a parameter for all if that exists and doesn’t hit other limits.

I can rerun my whole script that loads the whole PDF text into the index, and do what I need to successfully, but can I skip that load to pinecone (running embeddings again) step and just fetch/query the whole index?

Basically, I’m trying to replace this statement setting my docsearch variable below with something that uses the existing index ‘jj’ that already has my PDF texts stored as vectors. This is part of code using langchain and pinecone and an LLM for the embeddings and similarity.

docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name='jj')

so I can proceed to:

query = "(for example...) What are the key recommendations made in this document?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

‘jj’ index stats for reference:

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 367}},
 'total_vector_count': 367}

kathyh · March 30, 2023, 3:20pm

using top_k in a query isn’t working for me without specifying IDs. Query seems to still require a list of Index IDs as an argument. Without it, my query hits an exception that generates a 400 bad API request error.

So it’s sounding to me like I would have to assign unique IDs upon upload to the index in order to know what’s what there, and to have the ability to recall anything repeatedly once it’s there. Except I don’t really want to save a store of those unique IDs elsewhere in parallel. I’m a bit uncertain whether pinecone will help with avoiding repeatedly embedding vectors over and over which was one of my main hopes in using a vector db. Following!

index = pinecone.Index('jj')
index.query(
    top_k=367,
    namespace='',
    include_values=True
)

I tried this a number of ways, including only with ‘top_k=367’ (367 being my vector count in the whole index)

index.query(
    top_k=367
)

and with or without the other combinations of namespace='' and/or including data values

but it returns:
ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({‘content-type’: ‘application/json’, ‘date’: ‘Thu, 30 Mar 2023 15:05:54 GMT’, ‘x-envoy-upstream-service-time’: ‘1’, ‘content-length’: ‘53’, ‘server’: ‘envoy’, ‘connection’: ‘close’})
HTTP response body: {“code”:3,“message”:“No query provided”,“details”:[]}

porpoise · April 8, 2023, 8:49am

I’ve been using a zero vector for the vector argument. Using @kathyh’s example, if your embedding size is 1536:

index = pinecone.Index('jj')
index.query(
    top_k=367,
    vector= [0] * 1536, # embedding dimension
    namespace='',
    include_values=True,
    filter={'docid': {'$eq': 'uuid here'}} # Your filters here
)

Seems to be working ok in my testing, but I’m not really sure if it’s completely accurate…and still constrained to the top_k limits. You can also add the metadata filter as needed. Perhaps @Cory_Pinecone can comment if this is the official recommended way?

porpoise · April 9, 2023, 8:38pm

Another scenario: Suppose a user stores a bunch of ‘Document’ objects in pinecone, say 10,500.

What’s the recommended way to retrieve all his document objects to display in a UI? index.query() with top_k = 10,500 and filter matching user.id? But it will only return the first 10,000 due to the limits right? (Or only 1000 if returning metadata or values!)
Is there a “paging” function where I can retrieve in chunks? Eg, to retrieve items #20 to #29: num=10, start=20?

1000 or even 10,000 limit when the intent of pinecone is to store millions of objects/vectors is quite limiting, (no pun intended!)

sam · May 8, 2023, 2:05pm

Here’s my use case. I just loaded a whole bunch of legume abstracts from PubMed. But, unfortunately, the legume genus “Lens” is also a word for the optical device. So, I’d like to pull all the IDs from my index, then list the titles from metadata, then remove the ones that are about lenses rather than Lens plants. Yes, I could query “Lens” with a large top-k and hope I find them all, but I’d rather just plow through the article titles and get it done properly.

andrew · May 12, 2023, 3:21pm

trying to do this too but I can’t get

vector= [0] * 1536,

with my number of dimensions to return any ids. So I went with a text query for the string " " and top_k=1000

Seems to work! But what is the right way to say select * and only query by metadata?

sangbuinhu · May 24, 2023, 1:00am

My way: Query with top_k=10000, save the ids, then update their metadata with a field to distinguish the query (eg: “status”: “saved_id”). And continue the query with filter metadata {“status”: {“$nin”:[“saved_id”]}} and then update metadata. Repeat.