I have a JSON file containing data for which I initially created embeddings in Pinecone. Now, when I ask questions about this data, I want to check if the embeddings are already available. How can I determine if embeddings exist for the data I previously provided? I’d appreciate your guidance on this matter.
Hi @anusha.gudipati, welcome to the Pinecone community!
This is what vector IDs are for. Since each vector in an index has to have a unique vector ID, you can quickly determine if the embeddings in your JSON are present by using consistent vector IDs when upserting them. You can use the
fetch() API to retrieve up to 1000 vectors at once by their IDs.
Also, when upserting, if the same vector ID already exists, it will be updated with the current embedding values and metadata. So if your data has changed, you don’t have to check if it exists already, as long as you’re using consistent vector IDs.
@Cory_Pinecone solution works well for tracking ids. I would just like to add some more context since I am reading the question differently on my end.
If you embed the same sentence with the same model, you will likely get near similar vectors, but almost never exact. Think 0.02355 to 0.02354 difference, which makes hashing a vector into an ID an issue since a slight deviation is totally new hash.
You could do a similarity search before upsert and if the score returned is > 99.0 then its almost certain that data is already in the vector db. Obviously, ID hashing is easier, but you’ll get more duplicates. Similarity searching will have long upsert times, but should have better results for preemptively doing data-deduplication.
Again, its much easier to just keep a DB of text chunk => vector id as Cory outlined and then check that way. Just depends on your use case
Can you provide sample code for that?
Here is the code which im using
from multiprocessing import process
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter,CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders import JSONLoader
from flask import Flask, request
from dotenv import load_dotenv
PINECONE_API_KEY = os.getenv(‘PINECONE_API_KEY’) or ‘PINECONE_API_KEY’
PINECONE_ENVIRONMENT = os.getenv(‘PINECONE_ENVIRONMENT’) or ‘PINECONE_ENVIRONMENT’
start_time3 = time.time()
if index_name not in pinecone.list_indexes():
dimension=1536, # 1536 dim of text-embedding-ada-002
end_time3 = time.time()
def split_docs(data, chunk_size=100, chunk_overlap=10):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = text_splitter.split_text(str(data))
start_time1 = time.time()
docs = split_docs(data)
end_time1 = time.time()
elapsed_time1 = end_time1 - start_time1
print(len(docs),"time1",elapsed_time1) embeddings = OpenAIEmbeddings() start_time2 = time.time() print(pinecone.describe_index("first-index"),"uuuuuuuuuuu") index= Pinecone.from_existing_index(index_name, embeddings) # index = Pinecone.from_texts(docs, embeddings, index_name=index_name) end_time2 = time.time() elapsed_time2 = end_time2- start_time2 print("innnnnnnn",elapsed_time2) # query = "What is the Underwritten DSCR?" t=time.time() docs = index.similarity_search(query, include_metadata=False) e=time.time() print(e-t,"yyyyyyyyy") llm = OpenAI(temperature=0, openai_api_key=os.environ['OPENAI_API_KEY']) chain = load_qa_chain(llm, chain_type="stuff") print(chain.run(input_documents=docs, question=query)) return chain.run(input_documents=docs, question=query), 200
if name == ‘main’:
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.