Fetching vector ids (1M+ vectors)

We’re currently storing 1.3M+ vectors in pinecone. These vectors were created by chunking our text knowledge base and creating 1536 dim embeddings. Our knowledge is updated on a daily basis. Data gets added, deleted and updated all the time. We need the pinecone db to be in sync with our knowledge database.

Adding and updating vectors is fine for the moment. We can deduce the vector ids from our db data and can check if they exist in pinecone to see if they should get added/updated. Deleting vectors is another question. To know which vectors to delete, we need to get ALL of the pinecone vector ids, see if they still have a match in our database and if they don’t, delete them.

We were very surprised to see that there isn’t a good way to do this. At the moment, after looking at suggestions on how to achieve this here and here, we were able to hack our way through the query API and mark “queried” vectors through the update API with some heavy parallelization to fetch all of the ids in ~40mins. This is a huge bottleneck in our syncing process.

In what way could we achieve this in a reasonable time?

2 ways I can think of that would greatly speed things up are:

  1. Have a native way to fetch all of the vector ids in the db (not supported from links above)
  2. Have a way to download a pinecone collection and process the data locally. I see that is not supported

Something still hacky, but would take definitely less time to go through all the is records is something like this.

I suggested a structured approach, but the random outperforms.

Our current approach involves querying pinecone with random vectors and parallelizing the update of a metadata field. I managed to get the time down to ~30mins. That’s still the bottleneck in our sync process though