Using Update and Fetch API

I just wonder what is the use of these API. I have particular scenario, where I ingested 10 documents into the DB. Now 2 of the documents changed and I’d like to update the DB, but the above API accept vector ID as an indicator of the vector to update. From my point of view I have no clue what were the IDs of the vectors I ingested… I recall they were just, 1,2,3… but they could have been GUIDs as well… which makes these 2 API unpractical… or maybe I don’t understand something or not aware of a particular scheme of how to update 2 documents out of 10. FYI, I do have Metadata key “Source” : value, where the value is the document name (so maybe that’s the way?). It is possible but looks really a long way - extracting all vectors by applying the filter that contains the document name and then deleting all these vectors and ingesting new ones…(???)

1 Like

Hi @securigy,

If you know the source document of the data represented by the vector, then yes, you’d be better off using that as the vector ID rather than storing it as metadata.

Pinecone does not autogenerate vector IDs, so you have to supply something. It can be a string up to 64 bytes in size, so it’s not uncommon to use the source document (and possibly an ordinal if the document is chunked up into multiple vectors) as the vector ID. Then you know exactly what vectors correspond to what source document.

Well, in the ideal world or during the development, I know exactly what documents were ingested, but that would not be a scenario at the client site, where thousands of documents reside in specific folder on in some kind of common storage…

That said, having a document name as an ID in addition to a numeral is a better idea. However, again this will require to know the name of the document. I personally segment the documents no matter how big or small they are, simply because there is a better chance to get better semantic answer during the query based on the semantic meaning that paragraphs by all means provide… But then practically if my 50-page document replace page #16 and half of the pages #21 and #32, then there will be no choice but get all vectors with ID that includes the document name and delete them and re-ingest the document… So, the final result is like having document name in Metadata: you need to filter your document vectors and delete them and re-ingest the document…

I’ve been trying to get page numbers extracted from a document, but parsing Word with Word Object Model is not viable since I am not about loading hundreds and thousands of documents into Word, and with OpenXML is not trivial…

Will be glad to hear any other ideas, or point me where my thinking is wrong here…