Eyy people, how have you solved the rewriting vector problem when encountered, any solution?

If you save new vectors with IDs identical to those already present in your index, the new vectors will overwrite the existing ones. I’ve been trying to address this issue by starting to generate IDs from the last vector stored in the index. In other words, I obtain the total number of vectors using index.describe_index_stats().total_vector_count and then begin saving vectors starting from the last ID plus one. However, my code is not working perfectly because index.describe_index_stats().total_vector_count returns the value indefinitely. How have you addressed this problem?

batch = 100

for i in range(0, len(text_list()), batch):

last_id = index.describe_index_stats().total_vector_count
ids =
vect =

print("Total vector count ", last_id)
end = min(i+batch, len(text_list()))

if last_id == 0:
print("vector count ", last_id)
for id in tqdm(range(0, end)):
ids += [str(id)]
for id in tqdm(range(last_id+1, end+last_id)):
ids += [str(id)]

for text in tqdm(range(i, end)):
vect += [model.encode(text_list()[text])]

print("Text is being converted to embeddings ", i, “to”, end)

to_upsert = list(zip(ids, vect))

I would not take the approach of using the total_count +1 for the vector id, mostly because that is a quantitative value that only tells you the amount of data, and assuming the last insert is the highest id can wind up overwriting a lot of data.

In theory you could have each upsert run item have a metadata key called batch_id which is some random uuid. Then on the last element add an additional property called is_last: true and before the next batch you can query metadata for the previous batch_id + is_last: true and know certainly that the record was the highest value of the previous upsert and then start from there. You would locally keep a copy of each batch id that runs so you know what the last batch id was.

This is not foolproof, but would be less error-prone than the current implementation. Ideally vector IDs aren’t even sequential and are UUIDs themselves most because vector databases don’t work like relational ones and you would just track vector ids in some RDB you run in your system alongside the vector db.

If you do the traditional “tracking vector ids in a RDB alongside the vector db” method of doing things then no need for metadata since you can query the relational DB and just do SELECT MAX(vector_id) FROM vectors and be done instead of that metadata magic and mangement

1 Like

Do you really need the IDs to be consecutive numbers by insert order?
Otherwise, why not just use a uuid as id?

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.