How to set up recurrent upsert job? Any best practice advice

Hi there,

I’m a data scientist from Outside Inc. We just recently started to use Pinecone.
I’ve been reading Pinecone’s GitHub and example code, which provided very clear instructions on how to batch upsert data.

The problem we are trying to solve:

  • After batch upsert historical data, we’d like to schedule a cron job to upsert new data periodically.

One possible way I can think of is

  • first get the latest date (Pinecone metadata)
  • then query our postgres database (Our content catalog) to select data with published date >= pinecone latest date
  • upsert the selected data in batch
  • a better way might be whenever we have ingested a new item into our database, we use celery to upsert the data into Pinecone async.

Curious if anyone has experience on this solution and eager to hear what’re the best practices in terms of handling recurrent upsert jobs.

Thanks everyone in advance!

Best,

Wen

1 Like

The problem we are trying to solve:

  • After batch upsert historical data, we’d like to schedule a cron job to upsert new data periodically.

This doesn’t sound like a problem but an objective. You want to set up a cronjob to do batch upsert.

What is your problem ?

This seems similar to what I am doing at my startup too.

Currently we maintain a status column for every data point in the db and update it (bulk-update) as we index the data in pinecone. A simple filter on this column and datetime column is then required for every subsequent cron job. Having a status column also lets you create metrics for failures, amount of data pending to be indexed etc.

Hopefully this helps !

Hi @Wen,

I think @apinard’s suggestion of using a cron job to upsert new data periodically is the simplest way to go. But I’d like to know more about your setup, including your current architecture; there might be another option that’ll suit your needs better (for instance, you mentioned celery; queue systems like that can be great if you’re getting lots of data streamed in constantly, less so if the data itself is coming in batches).

I’ll email you offline to find a time that works for you.

Cory