@ZacharyProser Thanks a lot for your response. Let me share the sample data, code snippet, question, and return score details.
sample data file(Col0 → used as ID, Col5–> string to be embedded, col14–> saved as part of metadata)
SL No,xx,xx,xx,Impacted service,Summary,xx,xx,xx,xx,xx,xx,RCA,
1,xx,xx,xx,xx-stream,System stream pod having memory issue. Pods are getting restarted frequently.,x,x,x,x,x,x,x,x,Increase the stream Pod’s ram size and redeploy the pod,
2,xx,xx,xx,xx-system,Newcar|DATA analytics API reports are empty which was feeded from Dsystem directly.,x,x,x,x,x,x,x,x,Dsystem issue. Pls contact the Dsystem support team,
19,xx,xx,xx,xxxx,Lag issue for FITA-Topic-abc-Dsystem-medium in FITA region.,x,x,x,x,x,x,x,x,Check the data consumption rate in the abc connector from the topic FITA-Topic-abc-Dsystem-medium,
20,xx,xx,xx,MYAPP-data-curation,MYAPP data curation issue due to wrong formatted source file,x,x,x,x,x,x,x,x,The wrong formatted source file are ingested into the system. Pls verify the input file format.,
21,xx,xx,xx,MYAPP-data-curation,MYAPP data curation issue due to source file issue,x,x,x,x,x,x,x,x,The wrong formatted/empty source file are ingested into the system. Pls verify the input file format/contains.,
Code snippet
Embedding and storing part
def extract_info_from_file(self, file_path):
print("within extract_info_from_file")
with open(file_path, 'r', encoding='utf-8') as file:
# Read the first line of the file
first_line = file.readline().strip()
reader = csv.reader(file)
for row in reader:
# Print the row
print(f"{row[0]} --- {row[5]} --->{row[14]}<--")
embedding = self.embed_model.embed_documents(row[5].strip())
print(len(embedding[0]))
self.index.upsert(
vectors=[{"id": str(row[0]), "values": embedding[0],
"metadata": {"title": "RCA", "text": row[14].strip()}}],namespace="ns1")
return True```
**Retrieval query part**
question="data curation function gives issue"
embedding=None
query_result=None
embedding = self.embed_model.embed_documents(question)
print(len(embedding[0]))
query_result=self.index.query(
vector=embedding[0],
top_k=3,
include_values=False,
include_metadata=True,
namespace="ns1"
)
print(query_result)
**question is = "data curation function gives issue**
**Objective is to find the most matching rca details from the saved historical data**
**Post query submission, the response is**
{'matches': [{'id': '19',
'metadata': {'text': 'Check the data consumption rate in the abc '
'connector from the topic '
'FITA-Topic-abc-Dsystem-medium',
'title': 'RCA'},
'score': 0.417792231,
'values': []},
{'id': '21',
'metadata': {'text': 'The wrong formatted/empty source file are '
'ingested into the system. Pls verify the '
'input file format/contains.',
'title': 'RCA'},
'score': 0.404748976,
'values': []},
{'id': '20',
'metadata': {'text': 'The wrong formatted source file are '
'ingested into the system. Pls verify the '
'input file format.',
'title': 'RCA'},
'score': 0.404748976,
'values': []}],
'namespace': 'ns1',
'usage': {'read_units': 6}}
**Here is my doubts,
1--> Why we are getting the record of ID 19 with the highest score, though it doesn't hold the word "data curation" ?>
2--> Records ID of 20 or 21, should appear at the top with the higher score as both lines have "data curation" word with them. But its not happening.**
I am facing this kind of search-related issue for almost all types of questions. Can you help me to understand why this is happening?