For semantic search, pdf contents are converted to vectors and stored in pinecone.
However, if you upload the same pdf file, unnecessary embedding processing and vector storage will be performed.
Is it possible to check for duplicates by namespace or collection name before saving the data?
If duplicate check is not possible in pinecone
After saving the pdf slug in mysql, I will check it every time it is saved in pinecone and prevent duplicate vectors in advance.
this is my code
def post(self, request, *args, **kwargs):
question = request.POST.get('question')
if question:
# Retrieve PDF object
pdf = self.get_object()
# Split the text of the PDF into chunks of 1500 characters
text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
file_path = pdf.pdf_file.path
loader = UnstructuredPDFLoader(file_path)
data = loader.load()
texts = text_splitter.split_documents(data)
embeddings = OpenAIEmbeddings(openai_api_key=self.request.user.profile.openai_api_key)
pinecone.init(
api_key=self.request.user.profile.pinecone_api_key, # find at app.pinecone.io
environment=self.request.user.profile.pinecone_api_env # next to api key in console
)
index_name = "langchain2"
docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)
docs = docsearch.similarity_search(question, include_metadata=True)
llm = OpenAI(temperature=0.5, openai_api_key=self.request.user.profile.openai_api_key)
chain = load_qa_chain(llm, chain_type="stuff")