We are currently using Pinecone for our customer-facing application. However, we have noticed that the size of the index keeps increasing when we repeatedly ingest the same data into the vector store. Here is the code snippet we are using:
How can we identify if a document has already been ingested, so that we only upload new data to Pinecone or filter out duplicates within the Pinecone namespace?
What are the implications of uploading the same file multiple times? I have observed that the number of vectors keeps growing as if the namespace is being recreated from scratch every time.
There are a couple of alternatives to handle this outside of Pinecone:
In our data ingestion pipeline, we can keep track of duplicate documents and filter it out. At the scale we want to upload documents, we’d rather not do this, because it becomes cumbersome at scale.
We can delete the pinecone namespace every time we upload new data to it. We are not sure about the side effects of this such as, would we encounter code start problem where in the first few queries are not responsive or increased latency etc.
Please help us clarify these points and suggest improvements.
Two possible solutions come to mind for this. One is to make sure you’re using consistent vector IDs for your documents. If the same document has the same vector ID, then any subsequent upserts would only update the data for it; it wouldn’t add a new vector. However, this might not be feasible if you’re ingesting documents from different sources.
The other is to run a query with a top_k of 1 before inserting. If you get a match back that is 100% the same, then obviously, it’s the same document. But since the real world is what it is, you might not get 100% matches if there are subtle differences in the document data (like metadata in the document being different and that being tracked by the LLM). So you may need to experiment and adjust to something like 99.999% match and assume they’re the same document, discarding the upsert.
The downside to the second approach is that it slows your ingestion rate since you have to run that query first. And since it’s not feasible to do this in batches, you’d have to upsert each document individually. That or add a queue that does the upserts after they’ve been passed through a new stage in your pipeline that handles the comparison work.
Hi Cory,
I am interested in getting some more info on the first possible solution you are suggesting. How can you create and add a consistent vector ID? My guess is that Balaji could first add some more detail to its documents by adding metadata? For example:
metadatas = []
for doc in split_docs:
metadatas.append({
"source": doc.metadata["source"]
})
pinecone_vector_store = Pinecone.from_texts(
[doc.page_content for doc in split_docs],
embeddings,
index_name=index_name,
metadatas=metadatas
)
But how would a consistent vector ID be created, from what and how would the vector management process look like?
I have a very similar question. I want to use Pinecone Assistant to upload a large hierarchy of documents where it is possible there are duplicate documents due to copies being stored in a backup directory, for example. But also, I might restart the document uploading process and not know where the previous upload ended.
One thing I could do is make a hash of each document, and not include any documents with duplicate hashes.
But the question is, what is the effect of uploading the same document multiple times (possibly with different names)? Does that content get assigned a heavier weight, effectively assigning it more importance during retrievals?
I presume since I am intending to use Assistant, I don’t have access to the underlying vectors so I can’t compare them.