Struggling with Upsert data from DataFrame to Pinecone

jervis · December 19, 2023, 7:49pm

Hello,
I’ve been grappling with a challenge for a while now as I attempt to insert data into my Pinecone Index using a vector embedding generated by a BERT model. The problem lies in the platform’s difficulty recognizing classes like numpy arrays, lists, and strings. Even after creating a DataFrame and attempting to upsert it into the Pinecone Index, the error persists. I also tried passing the data through a tuple, but encountered an issue with its length of 768, which is unsupported. I’m seeking advice on how to proceed.

!pip install -qU \
    openai==0.27.7 \
    --upgrade pinecone-client[grpc] \
    pinecone-datasets=='0.5.0rc11' \
    transformers pinecone-client \
    pymupdf \
    tqdm \
    markdown \
    pandas  # Make sure to install the pandas library

from transformers import BertTokenizer, BertModel
import pinecone
import numpy as np
import pandas as pd
from google.colab import drive
drive.mount('/content/gdrive')

# Initialize Pinecone client
api_key_pinecone = "<your_pinecone_api_key>"
env_pinecone = "<your_pinecone_environment>"
pinecone_index_name = '<your_pinecone_index_name>'

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Function to generate embeddings
def generate_embeddings(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().detach().numpy()
    return embeddings

# Sample text data
texts = [
    "Pursuant to the legal representation of Ms. Samantha Turner, ...",
    "As you know our office represents the interests of Mr. John Doe ...",
    "We act on behalf of Mr. John Doe regarding a recent incident ...",
    "Representing Mr. John Doe in connection with a personal injury ...",
    "In light of a medical malpractice incident involving Mr. John Doe ...",
    "Our representation of Mr. John Doe pertains to a product liability ...",
    "Our firm is representing Mr. John Doe, who was involved in a recent ..."
]

# Assign unique text_ids to each text
text_ids = ["text_id_1", "text_id_2", "text_id_3", "text_id_4", "text_id_5", "text_id_6", "text_id_7"]

# Create embeddings for each text
embeddings = [generate_embeddings(text) for text in texts]

# Convert embeddings to DataFrame with text_ids
data = {'text_id': text_ids, 'embedding': embeddings}
df = pd.DataFrame(data)

# Specify the filename for the DataFrame
df_filename = "embeddings_dataframe.csv"

# Save DataFrame to a CSV file
df_save_path = f"/content/gdrive/MyDrive/Colab Notebooks/Folder/{df_filename}"  # Specify the desired path
df.to_csv(df_save_path, index=False)

# Print the DataFrame
print("DataFrame:")
print(df)

# Check if the knowledge base index exists and create an index if it does not exist
if pinecone_index_name not in pinecone.list_indexes():
    pinecone.create_index(
        pinecone_index_name,
        dimension=1536,
        metric='cosine',
    )

# Connect to the knowledge base index
index_knowledge_base = pinecone.Index(index_name=pinecone_index_name)

# Upsert data from DataFrame to Pinecone
index_knowledge_base.upsert(ids=df['text_id'].tolist(), vectors=df['embedding'].tolist())

print("Embeddings uploaded successfully to Pinecone.")

silas · December 19, 2023, 8:10pm

Hi @jervis, it looks like the dimension size of your index is set to 1536 while your BERT embeddings are 768. The index dimension setting needs to match the dimension size of the embedding model you’re using.

jervis · December 19, 2023, 8:18pm

I guess I’ve not changed that when I was trying with Open AI’s ada embedding model. Thanks for that @silas. However, even then when I used the Open AI’s ada embedding model, the same issue persisted!

silas · December 20, 2023, 5:30pm

I don’t think this ids parameter exists on the upsert function.

Try passing your data to the vectors param in one of these formats:

# tuple:
index.upsert(vectors=[
                 ('id1', [1.0, 2.0, 3.0], {'key': 'value'}),
                 ('id2', [1.0, 2.0, 3.0])]
            )
# or

# object:
index.upsert(vectors=[
                 {'id': 'id1', 'values': [1.0, 2.0, 3.0], 'metadata': {'key': 'value'}},
                 {'id': 'id2', 'values': [1.0, 2.0, 3.0]}]
            )

zeke · December 20, 2023, 5:32pm

Hi @jervis, the issue appears to be how you are formatting your upsert operation. The Upsert method accepts a list of vectors as its one required parameter (docs). In your code example, you are providing both a list of IDs and and a list of vector embeddings.

Instead, you will need to first pair each embedding with an ID and then upsert these records as the list of vectors. For example, a valid upsert format may look like (assuming the length of the array passed to values matches the dimensionality of your vector):

idx.upsert(vectors=[
    {
        'id': 'id-1',
        'values': [0.5, 0.7, 0.9, 0.1...]
    },
        {
        'id': 'id-2',
        'values': [0.1, 0.2, 0.1, 0.4...]
    }
])

If the error message you’ve been seeing is ValueError: Invalid vector value passed: cannot interpret type <class 'list'>, then I suspect that formatting your vectors correctly should resolve your issue. Please let us know if you have any questions!

jervis · December 20, 2023, 5:33pm

Hi @silas,
I’ll make the necessary modifications and get back to you if the problem persists.

Thanks

silas · December 20, 2023, 5:34pm

@zeke jinx!

jervis · December 20, 2023, 5:35pm

hey @zeke,

Thanks for the assist. Will get back if I run into another error.

system · December 21, 2023, 7:35pm

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.