Pinecone with LlamaIndex RetrieverQueryEngine gives inconsistent results

sinhavarmainc · November 13, 2024, 9:19pm

Hi,

I’m using Pinecone with LlamaIndex to query from existing embeddings inserted into the database by parsing a PDF document previously. My challenge is that I get inconsistent results.

        try:
            index_name = os.getenv("pinecone_index")
            api_key = os.getenv("pinecone_api_key")
            vector_store = PineconeVectorStore(
                index_name=index_name, 
                api_key=api_key, 
                namespace=namespace
                )
            logging.debug("Vector Store initialized")
            vector_index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
            logging.debug("Vector Store Index initialized")

            filters = MetadataFilters(
                filters=[MetadataFilter(key="file_id", value=file_id)]
                )
            retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=10, filters=filters)
            response_synthesizer = get_response_synthesizer()
            postprocessor = SimilarityPostprocessor(similarity_cutoff=0.7)

            query_engine = RetrieverQueryEngine(
                retriever=retriever, 
                response_synthesizer=response_synthesizer, 
                node_postprocessors=[postprocessor])
            logging.debug("Running query engine with question.")
            # question is initialized elsewhere before this code block
            result = query_engine.query(question)
            logging.debug(f"Result = {result}")
        except Exception as ex:
            logging.error(f"Error: {str(ex)}")
            raise

When I run this code, in 10-15% of the attempts, I get the right response, rest of the attempts, I’m getting an empty response. What could be happening?

Thanks for your help!

arjun · November 15, 2024, 4:13pm

Hi sinhavarmainc, welcome to the forum!

It would be helpful to understand what you mean by the right response versus empty response. Do you mean the responses are changing between calls, with the same queries?

It looks like you are making a RAG pipeline, where the response_synthesizer is generating a response based on what Pinecone is returning. Have you checked that the same chunks are being returned from Pinecone, despite the changes in response? It’s a bit hard to tell without knowing this specific information. My hunch is that at the generating layer is where the responses are changing. Generally for a given query, the returned chunks will be the same with few exceptions.

If that is indeed the case, the difference in response is simply from the generating LLM. You can adjust the temperature of the response down to reduce randomness, which should reduce variability in the responses.

Hope this helps, and please let us know if it still doesn’t work!
-Arjun

sinhavarmainc · November 16, 2024, 10:02pm

Hi arjun,

Thanks for your response. Indeed, I’m making a RAG pipeline. Yes, the same query was resulting different results. 1-2 out of 10 attempts (usually the beginning), it would return a valid response based on chunks retrieved. Other 9-10 times, the response would be empty. I’m using temperature=0.

Then I tweaked the prompt a little bit (i.e., made it a little more elaborate and explicit) and the issue went away. I’m curious what could be going on.

Just curious - how do I check if the same chunks are being returned?

Thanks for your help!

arjun · November 22, 2024, 10:40pm

Hi sinhavarmainc!

It’s quite hard to tell exactly what’s going on without seeing your entire scripting. The fact that the prompt was changed and then resolved your issue is interesting, but without exposing the chunks returned to the generating LLM it’s really hard to tell where the issue lies.

Like I said earlier, either the old query wasn’t “good” enough to return the chunks needed for the LLM to generate an appropriate answer, or the LLM itself was returning a poor answer even with good context. When you changed the query, that could have:

returned different chunks than the last query from the vector db (Pinecone), which then influenced the generation from the LLM, which of course uses these chunks to create a response
returned the same chunks but influenced the response of the LLM, as the query is usually passed to the LLM as well when generating the chunk. In this scenario, Pinecone is working as expected, and your LLM is unusually sensitive to the specific query you are using

I would try to replicate your pipeline using the Pinecone API directly, to see if the issue persists. This means, try to do the query with the Pinecone Python API, pass that to an LLM with a prompt, and see if that is working as intended.

Here’s our quickstart in case you need it!

The idea here is creating visibility into what your pipeline is doing to diagnose the issue.

If there’s a difference, then the issue lies with Llamaindex configurations inside the RetrieverQueryEngine instead of Pinecone, and I’d advise you to look to their forums instead for assistance.

I’m not an expert in the Llamaindex packages, so we can’t advise you on the best practice for exposing the chunks here.

If you find the answer there, please come back here and list it so others may learn from your experience.

Thanks again, and hope this helps!
-Arjun