Hello,
I’ve been grappling with a challenge for a while now as I attempt to insert data into my Pinecone Index using a vector embedding generated by a BERT model. The problem lies in the platform’s difficulty recognizing classes like numpy arrays, lists, and strings. Even after creating a DataFrame and attempting to upsert it into the Pinecone Index, the error persists. I also tried passing the data through a tuple, but encountered an issue with its length of 768, which is unsupported. I’m seeking advice on how to proceed.
!pip install -qU \
openai==0.27.7 \
--upgrade pinecone-client[grpc] \
pinecone-datasets=='0.5.0rc11' \
transformers pinecone-client \
pymupdf \
tqdm \
markdown \
pandas # Make sure to install the pandas library
from transformers import BertTokenizer, BertModel
import pinecone
import numpy as np
import pandas as pd
from google.colab import drive
drive.mount('/content/gdrive')
# Initialize Pinecone client
api_key_pinecone = "<your_pinecone_api_key>"
env_pinecone = "<your_pinecone_environment>"
pinecone_index_name = '<your_pinecone_index_name>'
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
# Function to generate embeddings
def generate_embeddings(text):
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().detach().numpy()
return embeddings
# Sample text data
texts = [
"Pursuant to the legal representation of Ms. Samantha Turner, ...",
"As you know our office represents the interests of Mr. John Doe ...",
"We act on behalf of Mr. John Doe regarding a recent incident ...",
"Representing Mr. John Doe in connection with a personal injury ...",
"In light of a medical malpractice incident involving Mr. John Doe ...",
"Our representation of Mr. John Doe pertains to a product liability ...",
"Our firm is representing Mr. John Doe, who was involved in a recent ..."
]
# Assign unique text_ids to each text
text_ids = ["text_id_1", "text_id_2", "text_id_3", "text_id_4", "text_id_5", "text_id_6", "text_id_7"]
# Create embeddings for each text
embeddings = [generate_embeddings(text) for text in texts]
# Convert embeddings to DataFrame with text_ids
data = {'text_id': text_ids, 'embedding': embeddings}
df = pd.DataFrame(data)
# Specify the filename for the DataFrame
df_filename = "embeddings_dataframe.csv"
# Save DataFrame to a CSV file
df_save_path = f"/content/gdrive/MyDrive/Colab Notebooks/Folder/{df_filename}" # Specify the desired path
df.to_csv(df_save_path, index=False)
# Print the DataFrame
print("DataFrame:")
print(df)
# Check if the knowledge base index exists and create an index if it does not exist
if pinecone_index_name not in pinecone.list_indexes():
pinecone.create_index(
pinecone_index_name,
dimension=1536,
metric='cosine',
)
# Connect to the knowledge base index
index_knowledge_base = pinecone.Index(index_name=pinecone_index_name)
# Upsert data from DataFrame to Pinecone
index_knowledge_base.upsert(ids=df['text_id'].tolist(), vectors=df['embedding'].tolist())
print("Embeddings uploaded successfully to Pinecone.")