My pinecone index with metric as cosine and 3072 as dimension, not giving proper search result

krishnendud · June 3, 2024, 12:20pm

I have a set of lines. I am embedding each line using “text-embedding-3-large” and saving it into the vector DB with metrics as cosine and 3072 as dimension. I have only 20 lines as sample data. During a query, even If I try with a single word during the search, I am not getting a proper response.

Sample code to insert the sample data

embed_model = AzureOpenAIEmbeddings(
model=“text-embedding-3-large”,
api_key=os.getenv(‘OPENAI_API_KEY’),
azure_endpoint=os.getenv(‘OPENAI_END_POINT’),
api_version=os.getenv(‘OPENAI_API_VERSION’),
)

embedding =embed_model.embed_documents(line_from_sample_file.strip())
index.upsert([(str(row[0]), embedding[0], {“title”: “RCA”, “text”: .strip()})])

Sample code to query the sample data

question=“<sample text”
embed_model.embed_documents(question)

anything, I am missing in this flow?

ZacharyProser · June 3, 2024, 2:59pm

Hi @krishnendud, and welcome to the Pinecone community forums!

Thanks for your question.

Your overflow workflow sounds solid - but I think there may be some issues in the following code:

Embedding Generation:

Ensure that embed_model.embed_documents(line_from_sample_file.strip()) correctly generates a 3072-dimensional embedding. Try printing (len(embeddings)) to sanity check.

Upsert Process:

The upsert line seems to have an error in the metadata dictionary. Specifically, “text”: .strip() is incomplete and incorrect.

You want your upsert to look like this:

index = pc.Index(index_name)

index.upsert(
    vectors=[
        {"id": "vec1", "values": [1.0, 1.5]},
        {"id": "vec2", "values": [2.0, 1.0]},
        {"id": "vec3", "values": [0.1, 3.0]},
    ],
    namespace="ns1"
)

Where values contains all the floating point numbers returned by OpenAI’s text embedding model.

Also, you mention you’re not getting a proper response:
1.What is your query?
2. What are you getting back?
3. What are you expecting to get back?

I hope that helps!

Best,
Zack

krishnendud · June 5, 2024, 6:38pm

@ZacharyProser Thanks a lot for your response. Let me share the sample data, code snippet, question, and return score details.

sample data file(Col0 → used as ID, Col5–> string to be embedded, col14–> saved as part of metadata)

SL No,xx,xx,xx,Impacted service,Summary,xx,xx,xx,xx,xx,xx,RCA,
1,xx,xx,xx,xx-stream,System stream pod having memory issue. Pods are getting restarted frequently.,x,x,x,x,x,x,x,x,Increase the stream Pod’s ram size and redeploy the pod,
2,xx,xx,xx,xx-system,Newcar|DATA analytics API reports are empty which was feeded from Dsystem directly.,x,x,x,x,x,x,x,x,Dsystem issue. Pls contact the Dsystem support team,
19,xx,xx,xx,xxxx,Lag issue for FITA-Topic-abc-Dsystem-medium in FITA region.,x,x,x,x,x,x,x,x,Check the data consumption rate in the abc connector from the topic FITA-Topic-abc-Dsystem-medium,
20,xx,xx,xx,MYAPP-data-curation,MYAPP data curation issue due to wrong formatted source file,x,x,x,x,x,x,x,x,The wrong formatted source file are ingested into the system. Pls verify the input file format.,
21,xx,xx,xx,MYAPP-data-curation,MYAPP data curation issue due to source file issue,x,x,x,x,x,x,x,x,The wrong formatted/empty source file are ingested into the system. Pls verify the input file format/contains.,

Code snippet
Embedding and storing part

def extract_info_from_file(self, file_path):
        print("within extract_info_from_file")
        with open(file_path, 'r', encoding='utf-8') as file:
            # Read the first line of the file
            first_line = file.readline().strip()
            reader = csv.reader(file)
            for row in reader:
                # Print the row
                print(f"{row[0]} --- {row[5]} --->{row[14]}<--")
                embedding = self.embed_model.embed_documents(row[5].strip())
                print(len(embedding[0]))
                self.index.upsert(
                    vectors=[{"id": str(row[0]), "values": embedding[0],
                              "metadata": {"title": "RCA", "text": row[14].strip()}}],namespace="ns1")
        return True```

**Retrieval query part**

    question="data curation function gives issue"

    embedding=None
    query_result=None
    embedding = self.embed_model.embed_documents(question)
    print(len(embedding[0]))

    query_result=self.index.query(
      vector=embedding[0],
      top_k=3,
      include_values=False,
      include_metadata=True,
      namespace="ns1"
    )
    print(query_result)


**question is  = "data curation function gives issue**
**Objective is to find the most matching rca details from the saved historical data**

**Post query submission, the response is**

{'matches': [{'id': '19',
              'metadata': {'text': 'Check the data consumption rate in the abc '
                                   'connector from the topic '
                                   'FITA-Topic-abc-Dsystem-medium',
                           'title': 'RCA'},
              'score': 0.417792231,
              'values': []},
             {'id': '21',
              'metadata': {'text': 'The wrong formatted/empty source file are '
                                   'ingested into the system. Pls verify the '
                                   'input file format/contains.',
                           'title': 'RCA'},
              'score': 0.404748976,
              'values': []},
             {'id': '20',
              'metadata': {'text': 'The wrong formatted source file are '
                                   'ingested into the system. Pls verify the '
                                   'input file format.',
                           'title': 'RCA'},
              'score': 0.404748976,
              'values': []}],
 'namespace': 'ns1',
 'usage': {'read_units': 6}}

**Here is my doubts, 
1--> Why we are getting the record of ID 19 with the highest score, though it doesn't hold the word "data curation" ?>
2--> Records ID of 20 or 21, should appear at the top with the higher score  as both lines have "data curation" word with them. But its not happening.**

I am facing this kind of search-related issue for almost all types of questions. Can you help me to understand why this is happening?

ZacharyProser · June 6, 2024, 2:05pm

Hi @krishnendud,

Thanks for your response!

Could you share what this line outputs?

    print(len(embedding[0]))

If it’s not exactly 3072, that could be your issue.

krishnendud · June 6, 2024, 2:20pm

@ZacharyProser I printed the embedding’s length before upsert as well as query, both time its 3072 only.

krishnendud · June 6, 2024, 7:58pm

@ZacharyProser or anyone from your team, If you can help me with guidance about why the search is not working as expected, it will be helpful. I am working on a POC and it got stuck here, due to this searching-related issue.

krishnendud · June 9, 2024, 6:42am

Any suggestion on my issue?

ZacharyProser · June 10, 2024, 1:29pm

Hi @krishnendud,

In my experience building semantic search and RAG apps using Pinecone, once you’ve eliminated obvious configuration and usage errors like we have here, the main consideration becomes data quality and volume.

For example, when building a RAG chatbot designed to answer questions as if it were a popular TV show character, my first approach (which failed miserably) was to chunk, embed and upsert every spoken line of the series - the entire script essentially. This led to poor performance because my data corpus did not actually contain the information that my application needed to answer questions successfully on topic.

When I changed my indexing process to instead ingest short, human-written descriptions of every episode, performance went through the roof - my app started working exactly as I expected.

My point is that you may be experiencing unexpected results due to your data quality and volume.

A few thoughts:

Consider trying one of the pre-built example Notebooks we have in GitHub - pinecone-io/examples: Jupyter Notebooks to help you get hands-on with Pinecone vector databases - pick one that is closest to your desired use case and run it with your own API keys and indexes to get a sense of exactly how everything fits together - then swap out the example dataset for your own data set - this will help rule out any subtle application-level errors in your code that could be affecting performance
If you keep your exact same application code unchanged, but swap in a public dataset and ask questions tuned to that dataset, is the performance higher or unchanged? If it’s higher, that’s an indication that your original dataset may have quality issues or be too small
Remember that semantic search is not keyword search - you’re going to get results back that best represent the underlying semantic structures of the queries you embed and send in to Pinecone - this again relates to the quality and volume of your data .

I hope this helps!

Best,
Zack