Can I do a duplicate check before saving vector data?

For semantic search, pdf contents are converted to vectors and stored in pinecone.
However, if you upload the same pdf file, unnecessary embedding processing and vector storage will be performed.
Is it possible to check for duplicates by namespace or collection name before saving the data?
If duplicate check is not possible in pinecone
After saving the pdf slug in mysql, I will check it every time it is saved in pinecone and prevent duplicate vectors in advance.

this is my code

    def post(self, request, *args, **kwargs):
        question = request.POST.get('question')
        if question:
            # Retrieve PDF object
            pdf = self.get_object()

            # Split the text of the PDF into chunks of 1500 characters
            text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
            file_path = pdf.pdf_file.path
            loader = UnstructuredPDFLoader(file_path)
            data = loader.load()
            texts = text_splitter.split_documents(data)
            embeddings = OpenAIEmbeddings(openai_api_key=self.request.user.profile.openai_api_key)
            pinecone.init(
                api_key=self.request.user.profile.pinecone_api_key,  # find at app.pinecone.io
                environment=self.request.user.profile.pinecone_api_env  # next to api key in console
            )
            index_name = "langchain2"

            docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)
            docs = docsearch.similarity_search(question, include_metadata=True)

            llm = OpenAI(temperature=0.5, openai_api_key=self.request.user.profile.openai_api_key)
            chain = load_qa_chain(llm, chain_type="stuff")

1 Like

chatGPT lied like this…
I wish it was real

        namespace = f"pdf_{self.slug}"
        if not index_exists(namespace):
            create_index(namespace, vector_dim=embeddings.vector_size)
        elif has_embedding(namespace, self.slug):
            return  # Skip pinecone storing for duplicate PDFs
        upsert_vectors(namespace, [vector], [self.slug])
1 Like

Even I am looking solution for this

I’m relatively new to Python and the world of LLM application development, transitioning from a background in .NET and SQL where I crafted business applications. So, I apologize in advance for any beginner oversights. Here’s my hack: . GitHub - LarryStewart2022/pinecone_Index: Multiple file upload duplicate solution