Namespaces Usage for different school files

taimoorqureshi80 · May 6, 2024, 5:10pm

Hey, my usecase is that I want to develop an AI chatbot using pinecone for different schools. Each school has different curriculums and I want to use namespaces but I’m unable to understand how can I provide directory or path for each school folder to be upserted into pinecone index with their namespace. I used chroma before this but it has a lot of limitations in segmentation. Can someone help me or provide me a related pinecone code or repo?

ZacharyProser · May 6, 2024, 5:28pm

Hi @taimoorqureshi80, and welcome to the Pinecone forums! Thanks for your question.

It sounds like you’d want to follow our Using Namespaces guide to create a new namespace for each school name.

You’d do this within your application code that is performing the upsert - ensuring that you are targeting the right school’s namespace when upserting your vectors, like so:

from pinecone import Pinecone

pc = Pinecone(api_key='YOUR_API_KEY')
index = pc.Index(index_name)

index.upsert(
    vectors=[
        {"id": "vec1", "values": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
        {"id": "vec2", "values": [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]},
        {"id": "vec3", "values": [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]},
        {"id": "vec4", "values": [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]}
    ],
    namespace="ns1"
)

The cool thing is that you don’t need a separate API call to create your namespace - it’s automatically created the first time you attempt to upsert to it.

What you want to ensure is that your application code is writing to and querying from the correct namespace for each school when iterating.

We also just published a chapter on using namespaces for this exact purpose (multi-tenancy) - you may find it interesting.

Let me know if that helps or if you have any follow-up questions!

taimoorqureshi80 · May 6, 2024, 6:31pm

index.upsert(
    vectors=[
        {"id": "vec1", "values": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
        {"id": "vec2", "values": [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]},
        {"id": "vec3", "values": [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]},
        {"id": "vec4", "values": [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]}
    ],
    namespace="ns1"

How to fetch these or how to provide it the directory file name and directory instead of ID and Values?

Please help me in that, I’d be really grateful to you

ZacharyProser · May 7, 2024, 2:47pm

Hi @taimoorqureshi80,

In that example code, the index’s upsert method requires that you supply vectors in that format (with ID and values fields, optional metadata and optionally choosing a namespace).

If I’m understanding your question correctly, the usual pattern is to:

Send your data (schools / text about schools) through an embedding model (for example one of OpenAI’s - you’ll get back vectors (also known as embeddings)
When you have the embeddings, you’ll upsert them to your index as demonstrated above
Each time you call upsert, you’ll specify the namespace of the school whose curricula vectors you’re upserting.

Does that make sense? Sit tight and I’ll link you to some of our open-source example Jupyter Notebooks that demonstrate this end to end so you can use them to get started.

ZacharyProser · May 7, 2024, 2:52pm

Here’s:

A chatbot Jupyter notebook that shows how to build a chatbot backed by a Pinecone vector database
A guide and full tutorial on using LangChain to build a chatbot backed by a vector store to do RAG

I recommend going through the second link end to end to see a good example of how to work with embedding models and vectors together in the context of building a chatbot that can answer with specific information.

Hope this helps!

taimoorqureshi80 · May 7, 2024, 3:33pm

@ZacharyProser Hey i was able to update 2 schools with different namespaces. However, I’m having issues during querying with namespaces. It’s returning full document instead of answer to the query and I’m unable to connect 3.5 gpt turbo for refined final results. Kindly provide assistance.

ZacharyProser · May 7, 2024, 7:27pm

Hi @taimoorqureshi80,

Could you please provide all the relevant code you’re trying to run? It’s difficult to diagnose what the issue might be from your description without seeing your code.

Best,
Zack

taimoorqureshi80 · May 8, 2024, 7:25am

Hey Zack,

I have attached my code. Kindly review it and assist me in getting the right answers as I am going to shift towards the paid version once this prototype is approved by my CEO.

Best Regards
Taimoor Hussain Qureshi

(Attachment PineconeTrial1.ipynb is missing)

taimoorqureshi80 · May 8, 2024, 7:27am

Hey zack,

I was not able to attach the py file so here’s my code:

!pip install python-docx
pip install python-docx sentence-transformers -q

!pip install pinecone
pip install python-docx sentence-transformers pymupdf

pip install pinecone-client openai

from pinecone import Pinecone

pc = Pinecone(api_key='')
index = pc.Index('tech')
from openai import OpenAI

# Directly passing the API key
client = OpenAI(api_key='')

from docx import Document
import fitz # PyMuPDF
from openai import OpenAI
import os

# Initialize OpenAI client with the API key
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY', ''))

def load_docx_text(filename):
  doc = Document(filename)
  text = [paragraph.text for paragraph in doc.paragraphs if paragraph.text]
  return " ".join(text)

def load_pdf_text(filename):
  doc = fitz.open(filename)
  text = ""
  for page in doc:
    text += page.get_text()
  return text

def split_into_chunks(text, max_size=800):
  words = text.split()
  chunks = []
  current_chunk = []
  current_length = 0

  for word in words:
    if current_length + len(word.split()) > max_size:
      chunks.append(" ".join(current_chunk))
    current_chunk = []
    current_length = 0
    current_chunk.append(word)
    current_length += len(word.split())

  if current_chunk:
    chunks.append(" ".join(current_chunk))

  return chunks

def generate_embeddings(documents):
  embed = []
  for document_text in documents:
    chunks = split_into_chunks(document_text, max_size=800)
    doc_embedding = []
    for chunk in chunks:
      response = client.embeddings.create(
                            input=chunk,
                            model="text-embedding-3-large"
                            )
    doc_embedding.append(response.data[0].embedding)
    # Averaging embeddings from chunks
    averaged_embedding = [sum(col) / len(col) for col in zip(*doc_embedding)]
    embed.append(averaged_embedding)
    return embed

# Document paths
docx_paths = [
'/content/drive/MyDrive/celeritas/curriculums/ELA-G3-U5-L2.docx',
'/content/drive/MyDrive/celeritas/curriculums/ELA-G3-U5-L3.docx',
]

pdf_paths = [
'/content/drive/MyDrive/celeritas/curriculums/NYCD3_ELA_EL_G3_Core_M3_U1_L0_Overview.pdf',
'/content/drive/MyDrive/celeritas/curriculums/NYCD3_ELA_EL_G3_Core_M3_U1_L1_LessonPlan.pdf',
]

# Process and upsert DOCX embeddings
docx_texts = [load_docx_text(path) for path in docx_paths]
docx_embeddings = generate_embeddings(docx_texts)
index.upsert(
vectors=[{"id": f"docx{i}", "values": emb} for i, emb in enumerate(docx_embeddings)],
namespace="CPS"
)

# Process and upsert PDF embeddings
pdf_texts = [load_pdf_text(path) for path in pdf_paths]
pdf_embeddings = generate_embeddings(pdf_texts)
index.upsert(
vectors=[{"id": f"pdf{i}", "values": emb} for i, emb in enumerate(pdf_embeddings)],
namespace="NYD3"
)

!pip install langchain
!pip install langchain_openai
def query_index(query, namespace=None, top_k=5):
  # Generate an embedding for the query
  query_embedding = client.embeddings.create(
  input=query,
  model="text-embedding-3-large"
   ).data[0].embedding

# Prepare query parameters
query_params = {
"vector": query_embedding,
"top_k": top_k,
"include_metadata": True
}
if namespace:
query_params["namespace"] = namespace

# Search the Pinecone index for similar embeddings
search_results = index.query(**query_params)

# Retrieve and format the results
results = []
for result in search_results['matches']:
document_id = result['id']
document_score = result['score']
results.append((document_id, document_score))

return results

# Example usage with namespace
query_text = "What are the key concepts in ELA Grade 3?"
namespace = 'CPS' # or 'NYD3' depending on your use case
searched_results = query_index(query_text, namespace)
print(searched_results)

def generate_embeddings(documents):
embed = []
text_chunks = [] # Store corresponding text chunks
for document_text in documents:
chunks = split_into_chunks(document_text, max_size=800)
doc_embedding = []
doc_text_chunks = [] # Corresponding text for each chunk
for chunk in chunks:
response = client.embeddings.create(
input=chunk,
model="text-embedding-3-large"
)
doc_embedding.append(response.data[0].embedding)
doc_text_chunks.append(chunk) # Save text chunk
averaged_embedding = [sum(col) / len(col) for col in zip(*doc_embedding)]
embed.append(averaged_embedding)
text_chunks.append(" ".join(doc_text_chunks)) # Save concatenated text for metadata
return embed, text_chunks

docx_paths = [
'/content/drive/MyDrive/celeritas/curriculums/ELA-G3-U5-L2.docx',
'/content/drive/MyDrive/celeritas/curriculums/ELA-G3-U5-L3.docx',
]
# Adjust the following for both DOCX and PDF paths
docx_embeddings, docx_texts = generate_embeddings(docx_texts)
index.upsert(
vectors=[{"id": f"docx{i}", "values": emb, "metadata": {"text": txt}} for i, (emb, txt) in enumerate(zip(docx_embeddings, docx_texts))],
namespace="CPS"
)

pdf_paths = [
'/content/drive/MyDrive/celeritas/curriculums/NYCD3_ELA_EL_G3_Core_M3_U1_L0_Overview.pdf',
'/content/drive/MyDrive/celeritas/curriculums/NYCD3_ELA_EL_G3_Core_M3_U1_L1_LessonPlan.pdf',
]
# Process and upsert PDF embeddings with text snippets as metadata
pdf_texts = [load_pdf_text(path) for path in pdf_paths]
pdf_embeddings, pdf_text_snippets = generate_embeddings(pdf_texts) # Generate embeddings and get text snippets
index.upsert(
vectors=[{"id": f"pdf{i}", "values": emb, "metadata": {"text": txt}} for i, (emb, txt) in enumerate(zip(pdf_embeddings, pdf_text_snippets))],
namespace="NYD3"
)

def query_index(query, namespace=None, top_k=5):
query_embedding = client.embeddings.create(
input=query,
model="text-embedding-3-large"
).data[0].embedding

query_params = {
"vector": query_embedding,
"top_k": top_k,
"include_metadata": True
}
if namespace:
query_params["namespace"] = namespace

search_results = index.query(**query_params)
results = []
for result in search_results['matches']:
document_id = result['id']
document_score = result['score']
document_text = result['metadata']['text'] # Retrieve text from metadata
results.append((document_id, document_score, document_text))

return results

def main():
  while True:
    query_text = input("Enter your query (or type 'exit' to quit): ")
    if query_text.lower() == 'exit':
      break
    namespace = input("Enter namespace (CPS or NYD3): ")
    results = query_index(query_text, namespace)
    print("Search Results:")
    for doc_id, score, text in results:
      print(f"Document ID: {doc_id}, Score: {score}\nText Snippet: {text}\n")

if __name__ == "__main__":
  main()

taimoorqureshi80 · May 8, 2024, 1:11pm

@ZacharyProser I’m waiting for your response. How to get efficient answers and retrieve data effectively?

ZacharyProser · May 8, 2024, 4:55pm

Hi @taimoorqureshi80,

Thanks for sharing your code - I’ll take a look as soon as I can.

In the meantime, if that Pinecone API key you shared in the beginning is a real key, could you please delete and recreate it, as it’s now compromised on this public internet forum, which would allow other people to make requests against your Pinecone account and your billing method.