Creating an Index but Not Uploading Vectors

I’m creating an index based on a large set of pdfs. The code creates an index with zero vectors. Thoughts?

Code:

# Loading documents from a directory with LangChain
import os
import time
from pinecone import Pinecone, ServerlessSpec
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_pinecone import PineconeVectorStore


from dotenv import load_dotenv

load_dotenv()

directory = 'dummyData'

def load_docs(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents

documents = load_docs(directory)

# Splitting documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
def split_docs(documents,chunk_size=500,chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

docs = split_docs(documents)

# Creating embeddings
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

query_result = embeddings.embed_query("Hello world")

#Storing embeddings in Pinecone 

index_name = "test-index"

pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))

if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)
    
pc.create_index(
  name=index_name, 
    dimension=384, 
    metric='cosine',
    spec=ServerlessSpec(
        cloud='aws',
        region='us-east-1'
    )
  )

# wait for index to be initialized
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

# connect new index
index = pc.Index(index_name)
time.sleep(1)

vectorStore = PineconeVectorStore.from_documents(docs, embeddings, index_name=index_name)

print (index.describe_index_stats())

Error Message:

This function will be deprecated in a future release and `unstructured` will simply use the DEFAULT_MODEL from `unstructured_inference.model.base` to set default model name
Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
/Users/gharker/Desktop/dev-playground/devNinjaIndy/docNinja/env/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Hi @docharker, thanks for your question!

One initial sanity check I’d recommend is to print out the results of each step along the way in your pipeline, so:

Add a print statements to show documents

(In Jupyter notebooks you can just write the name of the variable you want to print on a separate line, like so):

documents

Ensure they look correct (LangChain Document objects with source text, and some metadata)

Do the same for docs after they are split:

docs

Do the same for embeddings and the same for vectorstore.

Let me know if you see or learn anything new from doing that.

Looking forward to your response.

Best,
Zack

Thank you for responding. I value and appreciate the input. I’ll give it a try.

1 Like