how to check if embeddings exist for given data in Pinecone,

anusha.gudipati · November 8, 2023, 4:47pm

I have a JSON file containing data for which I initially created embeddings in Pinecone. Now, when I ask questions about this data, I want to check if the embeddings are already available. How can I determine if embeddings exist for the data I previously provided? I’d appreciate your guidance on this matter.

Cory_Pinecone · November 8, 2023, 5:01pm

Hi @anusha.gudipati, welcome to the Pinecone community!

This is what vector IDs are for. Since each vector in an index has to have a unique vector ID, you can quickly determine if the embeddings in your JSON are present by using consistent vector IDs when upserting them. You can use the fetch() API to retrieve up to 1000 vectors at once by their IDs.

Also, when upserting, if the same vector ID already exists, it will be updated with the current embedding values and metadata. So if your data has changed, you don’t have to check if it exists already, as long as you’re using consistent vector IDs.

tim · November 9, 2023, 1:09am

@Cory_Pinecone solution works well for tracking ids. I would just like to add some more context since I am reading the question differently on my end.

If you embed the same sentence with the same model, you will likely get near similar vectors, but almost never exact. Think 0.02355 to 0.02354 difference, which makes hashing a vector into an ID an issue since a slight deviation is totally new hash.

You could do a similarity search before upsert and if the score returned is > 99.0 then its almost certain that data is already in the vector db. Obviously, ID hashing is easier, but you’ll get more duplicates. Similarity searching will have long upsert times, but should have better results for preemptively doing data-deduplication.

Again, its much easier to just keep a DB of text chunk => vector id as Cory outlined and then check that way. Just depends on your use case

anusha.gudipati · November 9, 2023, 3:26am

Can you provide sample code for that?

anusha.gudipati · November 9, 2023, 3:39am

Here is the code which im using
from multiprocessing import process
import os
import pinecone
import json
import os
import openai
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter,CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders import JSONLoader
from flask import Flask, request
import time

from dotenv import load_dotenv

load_dotenv()
app= Flask(name)

OPENAI_API_KEY=os.getenv(‘OPENAI_API_KEY’)
PINECONE_API_KEY = os.getenv(‘PINECONE_API_KEY’) or ‘PINECONE_API_KEY’
PINECONE_ENVIRONMENT = os.getenv(‘PINECONE_ENVIRONMENT’) or ‘PINECONE_ENVIRONMENT’
start_time3 = time.time()

pinecone.init(
api_key=PINECONE_API_KEY,
environment=PINECONE_ENVIRONMENT
)
index_name=“first-index”
if index_name not in pinecone.list_indexes():
pinecone.create_index(
name=index_name,
metric=‘cosine’,
dimension=1536, # 1536 dim of text-embedding-ada-002
)
end_time3 = time.time()

list=pinecone.list_indexes()
print(list,end_time3-start_time3)

def split_docs(data, chunk_size=100, chunk_overlap=10):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = text_splitter.split_text(str(data))
return docs

split the docs

@app.route(‘/PinconeIndexing’, methods=[‘POST’])
def pinecone_indexing():
req=request.json
query=req[‘query’]
data=req[‘json’]
start_time1 = time.time()
docs = split_docs(data)
end_time1 = time.time()
elapsed_time1 = end_time1 - start_time1

print(len(docs),"time1",elapsed_time1)
embeddings = OpenAIEmbeddings()
start_time2 = time.time()
print(pinecone.describe_index("first-index"),"uuuuuuuuuuu")
index= Pinecone.from_existing_index(index_name, embeddings)
# index = Pinecone.from_texts(docs, embeddings, index_name=index_name)
end_time2 = time.time()
elapsed_time2 = end_time2- start_time2

print("innnnnnnn",elapsed_time2)

# query = "What is the Underwritten DSCR?"
t=time.time()
docs = index.similarity_search(query, include_metadata=False)
e=time.time()
print(e-t,"yyyyyyyyy")
llm = OpenAI(temperature=0, openai_api_key=os.environ['OPENAI_API_KEY'])
chain = load_qa_chain(llm, chain_type="stuff")
print(chain.run(input_documents=docs, question=query))

return chain.run(input_documents=docs, question=query), 200

if name == ‘main’:
app.run(debug=True, host=‘0.0.0.0’)

system · November 23, 2023, 3:40am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.