Removing duplicate embeddings

Working on a ChatYourData project with Langchain and Next.js, and search results from my Pinecone index suggest that I’ve somehow uploaded duplicate sets of embeddings: all my results are returning in identical pairs.

It’s a small (<1000 vector) db and easily enough deleted and reimported, but it has me wondering: is there perhaps some simple native method of deduping a vector db?

1 Like

no native method, best way I can think of is to search with returned vectors, grab IDs of those with duplicates, and delete one — it’s fairly straightforward with a smaller db like yours but this would get hard to do for a big dataset. In that case, you’d probably need to add a search and filter mechanism, a field like {"clean": 0} that once you have retrieved (and removed duplicates for) you update the metadata to {"clean": 1} and then during your next query you include filter={"clean": 0}

Ah-ha, very clever approach, appreciate that James! (I opted to fix my doubles by reinitializing my db, but it’s helpful to think through the filter operation as you describe, thanks.)