Struggling with upsert syntax

heiko · December 20, 2022, 8:23pm

Hey guys
I am struggling to get the parameters for index.upsert() right
I have this csv

where one colum is a paragraph of text and the other column is the embedding for that text, created with gpt-3.
I want to store the embeddings in the database and along with them two strings as metadata.

right now im doing something like

for i, row in df.iterrows():
    # Extract the values from the "paragraph" and "embeddings" columns
    paragraph = row["paragraph"]
    embeddings = row["embeddings"]

    # Call the upsert function with the values
    index.upsert(str(i), embeddings)

Can anybody can help me out here?
Thanks
Heiko

Cory_Pinecone · December 21, 2022, 4:05pm

Hi @heiko,

In your CSV file, the embeddings column is a string formatted as a three-dimensional list with only a single entry. It’s formulated as “[[ ... embeddings ... ]]”. You should edit those strings to have only a single pair of brackets as part of your data transformation. I also recommend adding an explicit ID column on each line with a string that will serve as the vector’s ID. These steps will make the rest of the load process much easier.

Once you start upserting it’s important to note that upsert() requires its arguments to be a list of tuples, not just the row ID and its embeddings. So more like this:

index.upsert(vectors=[(str(i),embeddings)])

Mind you, doing it this way will only upload one row at a time, so it will be slow. I’d recommend chunking your dataframe and doing the upserts in batches of 100; you may also need to convert your data into a specific format to ensure the upsert works as expected. You can use something simple like these examples:

def chunker(seq, size):
 'Yields a series of slices of the original iterable, up to the limit of what size is.'
 for pos in range(0, len(seq), size):
   yield seq.iloc[pos:pos + size]

def convert_data(chunk):
 'Converts a pandas dataframe to be a simple list of tuples, formatted how the `upsert()` method in the Pinecone Python client expects.'
 data = []
 for i in chunk.to_dict('records'):
  if 'metadata' in i:
   data.append((str(i['id']),i['embeddings'],i['metadata']))
  else:
   data.append((str(i['id']),i['embeddings']))
 return data

Then you would use it by breaking up the dataframe df:

for chunk in chunker(df,100):
 index.upsert(vectors=convert_data(chunk))

Let me know how that works for you or if you have any problems.

Cory

heiko · December 21, 2022, 7:34pm

thanks, Cory. Really appreciated. Ill go through it and let you know how it went

heiko · December 23, 2022, 6:41pm

im getting this error now.
PineconeProtocolError: Failed to connect; did you specify the correct index name?
Can you help me out again?

from pinecone.core.client.model.vector import Vector
import pinecone
import pandas as pd
import ast

df = pd.read_csv("/content/drive/MyDrive/semantic_search_mat/bgb_csv_with_embeddings.csv")

def chunker(seq, size):
 'Yields a series of slices of the original iterable, up to the limit of what size is.'
 for pos in range(0, len(seq), size):
   yield seq.iloc[pos:pos + size]

def convert_data(chunk):
 'Converts a pandas dataframe to be a simple list of tuples, formatted how the `upsert()` method in the Pinecone Python client expects.'
 data = []
 for i in chunk.to_dict('records'):
  if 'metadata' in i:
   data.append((str(i['id']),i['embedding'],i['metadata']))
  else:
   data.append((str(i['id']),i['embedding']))
 return data

pinecone.init(
    api_key="5527bc2a-2031-4df4-b8da-d7577bd03080",
    environment="us-west1-gcp"
)
# check if 'openai' index already exists (only create index if not)
if 'bgb' not in pinecone.list_indexes():
    pinecone.create_index('bgb', 2048)
# connect to index
index = pinecone.Index('bgb')
for chunk in chunker(df,100):
  index.upsert(vectors=convert_data(chunk))

Cory_Pinecone · December 23, 2022, 7:21pm

Hi @heiko,

I’ll take a look in a little bit, but I recommend you immediately delete your API key from your project and generate a new one.

Remember to never disclose your API keys as they give full control over your index, including the permission to delete or modify data in it. Because this key has been published in a public place it should be considered compromised and should be deleted.

Pinecone team members will never ask for your API key.

Cory

Cory_Pinecone · December 23, 2022, 7:58pm

@heiko please include the whole trace from the error. I need more context to know what the problem might be.

heiko · December 23, 2022, 10:00pm

thanks for pointing that out. cant believe that slipped through

heiko · December 23, 2022, 10:02pm

here you go

heiko · December 24, 2022, 4:50pm

Hey @Cory_Pinecone

I am getting a new error now - don’t know if I made it better or worse

ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({‘content-length’: ‘34482’, ‘content-type’: ‘text/plain’, ‘date’: ‘Sat, 24 Dec 2022 16:45:12 GMT’, ‘server’: ‘envoy’, ‘connection’: ‘close’})
HTTP response body: vectors[0].values: invalid value "[0.03192412108182907, 0.003208060981705785, 0.0036474389489740133, -0.04201175644993782, -0.030696269124746323, -0.021559614688158035, -0.013855453580617905, 0.02006693370640278, -0.02341342903673649, -0.022594861686229706, 0.03437982127070427, 0.010045504197478294, -0.016937118023633957, -0.019007612019777298, -0.008185671642422676, 0.02751830220222473, 0.05181048810482025, -0.022739315405488014, 0.009136654436588287, 0.0078004635870456696, 0.003451825585216284, -0.010502939112484455, 0.001489671878516674, -0.0048211198300123215, -0.01325356587767601, 0.005883451551198959, 0.012447035871446133, -0.04003756493330002, 0.009624183177947998, -0.020741047337651253, -0.00345784449018538, -0.0005966211319901049, -0.011809035204350948, -0.0037136466708034277, -0.01110482681542635, 0.003948383033275604, -0.020512331277132034, -0.012338696047663689, 0.0028243577107787132, -0.0053236959502100945, 0.016828779131174088, 0.0010826453799381852, -0.0008960601990111172, -0.0012301078531891108, -0.012398885563015938, 0.02334120310842991, -0.004559298977255821, -0.009834843687713146, -0.03743740916252136, 0.010924260132014751, 0.016985269263386726, 0.030070306733250618, -0.01552870124578476, 0.006668915040791035, -0.02317267470061779, -0.03895416855812073, -0.01875481940805912, 0.004941497463732958, -0.011700695380568504, 0.00015696100308559835, -0.011291411705315113, 0.005952668841928244, -0.005985772702842951, 0.0033585329074412584, -0…

from pinecone.core.client.model.vector import Vector
import pinecone
import pandas as pd
import ast
import sys

def chunker(seq, size):
 'Yields a series of slices of the original iterable, up to the limit of what size is.'
 for pos in range(0, len(seq), size):
   yield seq.iloc[pos:pos + size]

def convert_data(chunk):
 'Converts a pandas dataframe to be a simple list of tuples, formatted how the `upsert()` method in the Pinecone Python client expects.'
 data = []
 for i in chunk.to_dict('records'):
 
  #if 'metadata' in i:
   #data.append((str(i['id']),i['embedding'],i['metadata']))
  #else:
  if len(str(i['embedding'])) < 100:
    print("########### wrong embedding format found ######")
    continue
  embeddings_str = i['embedding']
  embeddings_str = embeddings_str.replace("\n", " ")

  data.append((str(i['id']),embeddings_str))

 df_debug_convert = pd.DataFrame(data)
 df_debug_convert.to_csv("/content/drive/MyDrive/semantic_search_mat/debug_convert.csv")
 print(data)
 return data

pinecone.init(
    api_key="",
    environment="us-west1-gcp"
)

# check if 'bgb' index already exists (only create index if not)
if 'bgb' not in pinecone.list_indexes():
    pinecone.create_index('bgb', 2048)
# connect to index
index = pinecone.Index(index_name="bgb")

df = pd.read_csv("/content/drive/MyDrive/semantic_search_mat/bgb_csv_with_embeddings.csv")

for chunk in chunker(df,1):
  #print(chunk)
  df_debug_chunks = pd.DataFrame(chunk)
  df_debug_chunks.to_csv("/content/drive/MyDrive/semantic_search_mat/debug_chunks.csv")
  index.upsert(vectors=convert_data(chunk))

heiko · December 24, 2022, 5:23pm

here is some additional info:

so something seems to be wrong with the embedding? Any advise on how to go on from here?

HTTP response body: vectors[0].values: invalid value "[-0.0024958476424217224, 0.010389539413154125, -0.008653582073748112, -0.01779192127287388, (and the rest of the list ]" for type TYPE_FLOAT

heiko · December 24, 2022, 9:20pm

I think I got it. seems like the list of vectors was a string object.

Jaturong · May 3, 2023, 9:37am

I want to store the Sentence (Thai Language) and Embedding (vector) in Pincecone.
For my .csv file consist of 3 columns: index, Sentences and Embedding.
I have followed your instructions.

def convert_data(chunk):
 'Converts a pandas dataframe to be a simple list of tuples, formatted how the `upsert()` method in the Pinecone Python client expects.'
 data = []
 for i in chunk.to_dict('records'):
  if 'Sentences in i:
   data.append((str(i['index']),i['Embedding'],i['Sentences']))
  else:
   data.append((str(i['index']),i['Embedding']))
 return data

I got the error as:

ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain', 'content-length': '1276', 'date': 'Wed, 03 May 2023 09:13:20 GMT', 'server': 'envoy', 'connection': 'close'})
HTTP response body: vectors[0].metadata: invalid value "ประกาศฉบับ 5 เตือน 'พายุฤดูร้อน' ถล่ม 27 จังหวัด" for type type.googleapis.com/google.protobuf.Struct

But when I try to use this code:

def convert_data(chunk):
 'Converts a pandas dataframe to be a simple list of tuples, formatted how the `upsert()` method in the Pinecone Python client expects.'
 data = []
 for i in chunk.to_dict('records'):
  if 'metadata' in i:
   data.append((str(i['index']),i['Embedding'],i['metadata']))
  else:
   data.append((str(i['index']),i['Embedding']))
 return data

I got the vector in Pinecone. But, the vector consist of index and Embedding.
Then, I tried fetching data from Pinecone, and I got an incorrect answer.
Is there any way I can store Sentences along with Embedding in Pinecone?

Cory_Pinecone · October 17, 2023, 6:25pm

Hi @Jaturong, I’m circling back on this one. Did you get your string metadata working?

I suspect this happened because you were inserting metadata solely as a string instead of a dict containing those strings. So it should be something like:

'metadata': {'sentences':i['Sentences']}

That’ll make sure your metadata is properly formatted.