Debugging embeddings / results relevance

heiko · January 3, 2023, 9:03pm

Hi Guys,
I embedded a couple of thousand text fragments using openai’s embedding API.
My plan was. to do a semantical search then.
So when I search for animals, I expected to get paragraphs that are about bees, dogs and so.
However, the returned items and their scores seem quite random to me. The first result has no connection to animals whatsoever, the second is ok. the thrid : no relation to animals at all. and so on.

Is there any smart way to debug that? am I missing some best practices to create the embeddings?

greg · January 3, 2023, 9:29pm

@heiko Could you share the code you’re using (minus API keys), both for creating the index and querying?

heiko · January 3, 2023, 11:17pm

gladly.
This is how i create the embeddings:

def get_embedding(text, model="text-embedding-ada-002"):
  return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

this is how i walk through an sql database and upsert the embeddings

from logging import exception
from pinecone.core.client.model.vector import Vector
import pinecone
import pandas as pd
import ast
import sys
import psycopg2

def chunker(seq, size):
 'Yields a series of slices of the original iterable, up to the limit of what size is.'
 for pos in range(0, len(seq), size):
   yield seq.iloc[pos:pos + size]

def convert_data(chunk):
  
  'Converts a pandas dataframe to be a simple list of tuples, formatted how the `upsert()` method in the Pinecone Python client expects.'
  data = []
  for i in chunk.to_dict('records'):
    if len(str(i['embedding'])) < 100:
      print("########### wrong embedding format found ######")
      continue

    embedding_string = str(i['embedding'])
    embedding_list = ast.literal_eval(embedding_string)    
    data.append((str(i['id']),embedding_list))
  return data

pinecone.init(
    api_key="xxxxxxxxxxxxxx",
    environment="us-west1-gcp"
)

# check if 'bgb' index already exists (only create index if not)
if 'bgb' not in pinecone.list_indexes():
    pinecone.create_index('bgb', 1536)
# connect to index
index = pinecone.Index(index_name="bgb")



# Connect to the database
conn = psycopg2.connect(
    "postgres://lawdb_user:xxxxxxxxxx@dpg-cel1g7ha6gdkdn1dt4cg-a.oregon-postgres.render.com/lawdb"
)

# Create a cursor
cur = conn.cursor()

# Execute a SELECT statement to retrieve all fields
cur.execute("SELECT * FROM german_laws where embedding is not null")

# Fetch all rows
rows = cur.fetchall()

# Create a Pandas data frame from the rows
df = pd.DataFrame(rows)

# Set the column names to the names of the fields
df.columns = [desc[0] for desc in cur.description]

# Close the cursor and connection
cur.close()
conn.close()

df["embedding"] = df["embedding"].str.replace("{", "[")
df["embedding"] = df["embedding"].str.replace("}", "]")

for chunk in chunker(df,100):
  try:
    data = convert_data(chunk)
    #print ("data created")

    index.upsert(data)
    #print ("dara upserted")
  except Exception as e:
    print(e)
    #sys.exit()
print("index Status: " +str(index.describe_index_stats()))

and this is the function with wich I retrieve semantically close doc ids from pinecone

def get_related_doc_ids(embedding):
    pinecone.init(
        api_key="xxxxxxxxxxxxxxxx",
        environment="us-west1-gcp"
    )
    index=pinecone.Index(index_name="bgb")
    response = index.query(
    vector=embedding,
    top_k=25,
    include_values=True
    )
    ids = []
    scores = []
    for i in response.matches:
      ids.append(i['id'])
      scores.append(i['score'])  
    return ids,scores

Cory_Pinecone · January 7, 2023, 12:39am

Hi @heiko,

I see you have the get_embeddings() function, but I don’t see where it gets called later. It looks like the “embeddings” you’re upserting are just strings, not arrays of floats as would be expected. This line for instance:

df["embedding"] = df["embedding"].str.replace("}", "]")

Just replacing braces with brackets doesn’t convert the string into a vector. If the embeddings you’re working with already exist in the SQL database you’re pulling from you first need to transform those into arrays of floats and upsert that as your data.

If you give an example of the raw strings you’re working with I can help you write a Python function that will do that.

Cory

heiko · January 11, 2023, 7:23pm

Hi Cory,
thanks for your input. If I understand you correctly you are saying I am putting wrong values into pinecone anyway ? I assumed that if I don’t get an error from the pinecone api on upset, the values would be in the right format. Is that assumption not right?