CAnnot upsert more than 2000 records from CSV

areeb · March 3, 2023, 5:46pm

I am trying to upsert embeddings to my pod, and it always stops just when it hits 2000 records. How can I go around this. Is there any workaround? This is driving me mad. I’ve got about 200000 rows to go. Any help is appreciated. Please assist!

Here’s the code I am currently using:

from tqdm.auto import tqdm
import http

MODEL = "text-embedding-ada-002"

count = 0  # we'll use the count to create unique IDs
batch_size = 250  # process everything in batches of 32
for i in tqdm(range(0, len(trec['Content']), batch_size)):
    # set end position of batch
    i_end = min(i+batch_size, len(trec['Content']))
    # get batch of lines and IDs
    lines_batch = trec['Content'][i: i+i_end]
    #print(lines_batch)
    ids_batch = [str(n) for n in range(i+3600, i_end+3600)]
    # create embeddings
    
    res = openai.Embedding.create(input=list(lines_batch), engine=MODEL)
    embeds = [record['embedding'] for record in res['data']]
    # prep metadata and upsert batch
    meta = [{'Content': line} for line in lines_batch]
    to_upsert = zip(ids_batch, embeds, meta)
    # upsert to Pinecone
    index.upsert(vectors=list(to_upsert))

Anyone know how I can get it to continuously keep uploading?

Cory_Pinecone · March 3, 2023, 6:06pm

Hi @areeb,

Do you get any error messages when the process stops? Or does the process just hang? If the latter, can you add some print() statements so we can see where it’s hanging? It could be the OpenAI call, it could be the upserts, it could be something else entirely. Without knowing where it’s getting blocked, it’s hard to say how to fix it.

We also recommend using batch sizes of 100 when performing upserts. I doubt it has anything to do with the issue you’re running into, but I wanted to call it out as just a good habit to be in.

areeb · March 3, 2023, 6:10pm

Here’s the error that I get:

15%
8/54 [00:25<02:38, 3.45s/it]
---------------------------------------------------------------------------
InvalidRequestError                       Traceback (most recent call last)
<ipython-input-98-80e716b39ad7> in <module>
     15     # create embeddings
     16 
---> 17     res = openai.Embedding.create(input=list(lines_batch), engine=MODEL)
     18     embeds = [record['embedding'] for record in res['data']]
     19     # prep metadata and upsert batch

4 frames
/usr/local/lib/python3.8/dist-packages/openai/api_requestor.py in _interpret_response_line(self, rbody, rcode, rheaders, stream)
    677         stream_error = stream and "error" in resp.data
    678         if stream_error or not 200 <= rcode < 300:
--> 679             raise self.handle_error_response(
    680                 rbody, rcode, resp.data, rheaders, stream_error=stream_error
    681             )

InvalidRequestError: ['In the 2007 season, mcgowdu01 had a record of 12 wins and 10 losses, with an earned run average (ERA) of 4.08. He was the starting pitcher in 27 of the 27 games in which he pitched. He recorded 1 shutouts, 0 saves, and finished 0 games. He walked 61 batters, and struck out 144. He allowed 14 home runs, was responsible for 2 hit by pitches, and threw 13 wild pitches. He had 2 complete games.', 'In the 2007 season, mcleama01 had a record of 0 wins and 0 losses, with an earned run average (ERA) of 8.22. He was the starting pitcher in 0 of the 4 games in which he pitched. He recorded 0 shutouts, 0 saves, and finished 0 games. He walked 2 batters, and struck out 5. He allowed 4 home runs, was responsible for 0 hit by pitches, and threw 0 wild pitches. He had 0 complete games.', 'In the 2007 season, mclemma02 had a record of 3 wins and 0 losses, with an earned run average (ERA) of 3.86. He was the starting pitcher in 0 of the 29 games in which he pitched. He recorded 0 shutouts, 0 saves, and finished 9 games. He walked 18 batters, and struck out 35. He allowed 5 home runs, was responsible for 1 hit by pitches, and threw 2 wild pitches. He had 0 complete games.', 'In the 2007 season, mechegi01 had a record of 9 wins and 13 losses, with an earned run average (ERA) of 3.67. He was the starting pitcher in 34 of the 34 games in which he pitched. He recorded 0 shutouts, 0 saves, and finished 0 games. He walked 62 batters, and struck out 156. He allowe...```

areeb · March 3, 2023, 6:11pm

I even tried using smaller batches. But it always stops at around 2000 records.

Cory_Pinecone · March 3, 2023, 8:39pm

Hi @areeb,

That looks like an OpenAI error. I hate to punt on customer problems, but I’m far from an OpenAI expert and fear I could give you the wrong advice if I tried to debug this. I recommend you crosspost on their community: https://community.openai.com/

Sorry.

Cory