Adding Dataframe to pinecone | ApiValueError: Unable to prepare type DataFrame for serialization

rstang · June 19, 2023, 10:59pm

Hi, I am trying to add Dataframes into pinecone. I’m using haystack to create a QA model for tables

processed_tables = []
with open(f"{table_dir}/tables.json") as tables:
    tables = json.load(tables)
    print(tables)
    for key, table in tables.items():
        current_columns = table["header"]
        print(current_columns)
        current_rows = table["data"]
        current_df = pd.DataFrame(columns=current_columns, data=current_rows)
        document = Document(content=current_df, content_type="table", id=key)
        processed_tables.append(document)

document_store.write_documents(processed_tables)

And this is the error that I’m getting –

ApiValueError: Unable to prepare type DataFrame for serialization

This is what my document_store looks like –

document_store = PineconeDocumentStore(
    api_key=pinecone_key,
    environment='northamerica-northeast1-gcp',
    similarity="cosine",
    index='new_scrapped',
    embedding_dim=1536
)

Can anyone please help out?

chris.a · October 17, 2023, 5:42pm

Hi there! This is a great use case, I am hoping I can help provide a few options here.

In your code snippet, you’re attempting to create a Document object with a pandas DataFrame as the content. However, Pinecone’s serialization method doesn’t natively support DataFrame objects, hence the error. I am providing two options that can help.

Convert DataFrame to a Serializable Format:

Before storing the DataFrame in Pinecone, convert it to a format that Pinecone can understand. Common serializable formats include JSON or string representations.
You can convert a DataFrame to a JSON string using the to_json() method, or to a simple string using the to_string() method provided by pandas.

json_string = current_df.to_json(orient="split")
# or
string_representation = current_df.to_string()

Store String Representation:

Update the Document initialization to use the string or JSON representation of the DataFrame instead of the DataFrame object itself.

document = Document(content=json_string, content_type="table", id=key)
# or
document = Document(content=string_representation, content_type="table", id=key)

Finally, make sure Haystack is able to digest this format. You may need to adjust how you upset the query due to JSON representations. I hope that this helps get this up and running!