Best way to store conversation data

khalid.salama · July 26, 2023, 9:32am

Hey there Pinecone community.
I have conversation data that looks like this:

conversation = {
  "_id": "a498c5e1-6d73-442a-893a-a4e875170ec5",
  "patient_id": "eca50a0f-4fd7-4a78-bea7-50b1a62a7890",
  "messages": [
    {
      "role": "user",
      "text": "Hallo, meine 13 Wochen alte Katze ist gerade vom Fensterbrett gefallen und humpelt seitdem, wenn man die Pfote anfässt miaut sie aber nicht. Sie belastet die Pfote gar nicht."
    },
    {
      "role": "agent",
      "text": "Hallo Sarah, ich bin Julia"
    },
    {
      "role": "user",
      "text": "Hallo."
    },
    {
      "role": "agent",
      "text": "Um einschätzen zu können, wie schlimm es ist und entscheiden zu können, was die notwendigen weiteren Schritte sind und ob Du ggf. noch heute in eine Klinik fahren solltest, würde ich Dich gerne mit einer unserer Tierärztinnen verbinden."
    },
    {
      "role": "user",
      "text": "Ok danke."
    }
  ]
}

How can I store such a data in the vector database? Should I separate each of the messages and store them, or would it be a better idea to join the whole conversation into one string and word embed the whole thing?
I’d like to store the data in way that my LLM can get contextual knowledge and can relate the messages together.

I’d be glad to hear your suggestions.

dra · July 26, 2023, 12:48pm

@khalid.salama

Hi!

The optimal approach varies depending on the purpose of storing the conversation data.

Natural Conversation: If the main purpose is natural conversation between AI and the user, one option is to not vectorize the text and store it as it is. Given the daily expanding token limits of LLM, a brute force method like retaining all conversation history in Redis and providing it all as context to LLM when needed might become the best solution in the future.
Vectorization for Search: On the other hand, if you want to vectorize the conversation data and make it searchable, you need to adopt a different method. It’s not recommended to store the entire text as a single vector. That’s because a single conversation can contain a mix of different topics, and condensing them all into one vector can make proper searching difficult. However, vectorizing each message individually presents the problem of losing information such as the relationship between messages and the flow of conversation. Therefore, to make it work well, you might want to divide the conversation into relevant parts and vectorize each one.

If there are parts that do not match your thinking, I would like to hear a more specific story.

khalid.salama · July 26, 2023, 2:44pm

@dra thanks for your reply!
I think it would be more reasonable to store the conversations as whole, since mostly all of them are about one topic respectively, or I should say one problem to be more specific.

The data I have is conversations between vets and clients who have concerns with their pets (health/ behavioral problems for example). I want to create an app using an LLM that can extract useful information from the vets responses and maybe mimic them.

If you have any suggestion, I’d be glad to hear them.

dra · July 27, 2023, 8:40am

@khalid.salama thank you for your reply.
Now I understand what you are trying to achieve.

My first suggestion was based on the idea that not all conversations necessarily revolve around a single, concrete issue. For example, a conversation might start with a cat-related issue, then move on to dog health or something completely unrelated. To address this, I suggested splitting the conversation into relevant parts while preserving the context and vectorizing each part separately. However, like you said, if all the conversations really revolve around he one topic, then there’s less need for such splits.

If I were to develop a similar application, I would format the entire conversation based on the following factors and embed it in Pinecone. Similarity searches are based on symptoms and contextual information.

Symptoms: This section contains specific information about your pet’s health, such as physical symptoms (coughing, weight loss, loss of appetite, etc.) and behavioral changes (unusual behaviors or patterns reported by pet owners). is included. gain.
Veterinarian Advice: This section contains specific advice and recommendations provided by your veterinarian. This may include treatment options, testing suggestions, and the need for further testing.
Pet Type: This section identifies the type of pet, such as dog, cat, or other animal.
Age of your pet: Your pet’s age is very important in understanding the importance of certain symptoms and problems.
Background: This section contains background provided by pet owners and detailed background information available to veterinarians before giving specific advice.

To achieve this, just tell LLM to format according to this format.

khalid.salama · July 27, 2023, 9:11am

Thanks for the suggestion @dra.
I’m just afraid that I didn’t understand the last part completely. What do you exactly mean by formatting the conversation, did you meant including metadata to the upsert? And how can I tell the LLM to format?
I’m really new into this topic and maybe my questions sound a bit naive

dra · July 31, 2023, 3:24pm

I’ve been busy, so my reply is a bit late.

formatting the conversation

For example, the following prompt instructs the LLM to analyze the conversation and convert it into the specified JSON format. While this is a well-known method, it’s not necessarily the only correct approach. To achieve better results, it may be necessary to refine the prompt over time.

# Define the example conversation and its corresponding JSON format
example_conversation = """
**Client:** "Doctor, my dog has been scratching a lot recently and he seems a bit restless."
**Veterinarian:** "I see. Has there been any change in his food or environment?"
**Client:** "No, everything has been the same."
**Veterinarian:** "I understand. Excessive scratching and restlessness could be signs of skin issues or allergies. It's best to bring him in for a checkup. We may need to conduct a skin test."
"""
example_json = """
{
  "symptoms": {
    "description": "Dog has been scratching a lot recently and seems a bit restless."
  },
  "veterinarian_advice": {
    "description": "Excessive scratching and restlessness could be signs of skin issues or allergies. It's best to bring him in for a checkup. We may need to conduct a skin test."
  },
  "pet_information": {
    "type": "Dog",
    "breed": "",
    "age": ""
  },
  "background": {
    "description": "Dog has been scratching a lot recently and seems a bit restless. No changes in food or environment."
  }
}
"""

# Define the JSON format to be used
json_format = """
{
  "symptoms": {
    "description": ""
  },
  "veterinarian_advice": {
    "description": ""
  },
  "pet_information": {
    "type": "",
    "breed": "",
    "age": ""
  },
  "background": {
    "description": ""
  }
}
"""

# Combine everything into the final prompt
prompt = f"""
Translate the following conversations into JSON format. The conversations are between a client and a veterinarian.

**Example 1:**
{example_conversation}

**JSON Format:**
{example_json}

Translate this conversation into the following JSON format.

{conversation}

{json_format}
"""

Create a query in the specified JSON format using the information extracted from the conversation.

prompt = """
Translate the following conversation into a JSON format query for a vector database. The conversation is between a client and a veterinarian.

**Conversation:**
**Client:** "Doctor, my cat has been drinking a lot of water recently and she seems a bit lethargic."
**Veterinarian:** "I see. How about her food intake and urination? Has there been any change?"
**Client:** "Yes, she's been eating less and she seems to urinate more than usual."
**Veterinarian:** "I understand. Excessive drinking, decreased appetite, and increased urination could be signs of kidney issues. It's best to bring her in for a checkup. We may need to conduct a blood test and urine test."

Translate this conversation into the following JSON format query.

``` json
{
  "symptoms": {
    "description": ""
  },
  "veterinarian_advice": {
    "description": ""
  },
  "pet_information": {
    "type": "",
    "breed": "",
    "age": ""
  },
  "background": {
    "description": ""
  }
}

The expected result is as follows.

The “breed” and “age” fields under “pet_information” are left blank as they are not evident from the conversation.

{
  "symptoms": {
    "description": "Cat has been drinking a lot of water, appears lethargic, eating less, and urinating more than usual."
  },
  "veterinarian_advice": {
    "description": "Excessive drinking, decreased appetite, and increased urination could be signs of kidney issues. It's best to bring her in for a checkup. We may need to conduct a blood test and urine test."
  },
  "pet_information": {
    "type": "Cat",
    "breed": "",
    "age": ""
  },
  "background": {
    "description": "Cat has been drinking a lot of water and appears lethargic. Her food intake has decreased and she seems to urinate more than usual."
  }
}

I was imagining something like this when I was talking.