Getting error "metadata too big" but they are not

Hi im facing the followin issue:
problem.py (github.com)

So summariuzed: it says me that the metadata are too big but i they just contain one single “source” flied witha web (.html) url in it

Could it come because the page_content + metadata maybe is too big?

What are the approaches to solve such a ploblem if I cant load my split data inside the vs?

Hi @derbenedikt.sterra, thanks for your question, and thank you so very much for sharing/linking to your code, which makes it much easier for us to help!

Here are some thoughts after reviewing the program you shared:

  1. Metadata size: The error suggests that the metadata is still too large, even after your attempts to reduce it. This could be because: a) The page_content might be getting included in the metadata inadvertently. b) There might be other fields in the metadata that you’re not accounting for.
  2. Check total size: Your logs show that you’re calculating the size of metadata and page_content separately. However, our limit applies to the combination of both. Make sure you’re checking the total size.
  3. Trimming approach: Your current approach tries to trim metadata fields, but it might not be effective enough. Consider a more aggressive trimming strategy.

Here are some things you might try next:

  1. Separate content and metadata: Ensure that page_content is not being included in the metadata. Double-check your Document creation process.
  2. Trim page_content: If the combined size of metadata and page_content is too large, you might need to trim the page_content as well. You can do this by truncating it to a maximum length.
  3. More aggressive metadata trimming: Instead of just emptying the ‘text’ field, consider removing all fields except ‘source’, or limiting each field to a maximum length.
  4. Use a smaller chunk size: Reduce your chunk_size in the RecursiveCharacterTextSplitter to create smaller documents.
  5. Implement a size check before insertion: Before trying to insert a document into Pinecone, check its total size and skip or further trim documents that are too large.

Here’s some sample code implementing the above approaches you could try out / use for inspiration:

async def short_string_simple(self, split_docs: List[Document]):
    """Shortens page_content and metadata within each Document if it exceeds max_bytes."""
    print("SHORT DOCs...")
    max_total_bytes = 40000  # Pinecone's limit
    max_metadata_bytes = 5000  # Reserve some space for metadata
    
    try:
        for doc in split_docs:
            # Convert metadata to a simple dict with only 'source'
            doc.metadata = {'source': doc.metadata.get('source', '')}
            
            metadata_bytes = len(json.dumps(doc.metadata).encode('utf-8'))
            content_bytes = len(doc.page_content.encode('utf-8'))
            total_bytes = metadata_bytes + content_bytes
            
            print(f"Initial sizes - Metadata: {metadata_bytes}, Content: {content_bytes}, Total: {total_bytes}")
            
            if total_bytes > max_total_bytes:
                # Trim metadata if necessary
                if metadata_bytes > max_metadata_bytes:
                    doc.metadata['source'] = doc.metadata['source'][:max_metadata_bytes]
                    metadata_bytes = len(json.dumps(doc.metadata).encode('utf-8'))
                
                # Calculate how much we can allow for content
                max_content_bytes = max_total_bytes - metadata_bytes
                
                # Trim content if necessary
                if content_bytes > max_content_bytes:
                    doc.page_content = doc.page_content.encode('utf-8')[:max_content_bytes].decode('utf-8', 'ignore')
                    content_bytes = len(doc.page_content.encode('utf-8'))
            
            total_bytes = metadata_bytes + content_bytes
            print(f"Final sizes - Metadata: {metadata_bytes}, Content: {content_bytes}, Total: {total_bytes}")
            
            if total_bytes > max_total_bytes:
                print(f"Warning: Document still exceeds maximum size after trimming.")
    
    except Exception as e:
        print(f"ERROR WHILE SHORTING STRING SIMPLE: {e}")
        return None

You would then use this function before creating your vectorstore:

await self.short_string_simple(split_docs)

Finally, when you do create your vectorstore, you may want to add another check to ensure you’re not trying to upsert a payload that exceeds the size limit:

valid_docs = []
for doc in split_docs:
    total_size = len(json.dumps(doc.metadata).encode('utf-8')) + len(doc.page_content.encode('utf-8'))
    if total_size <= 40000:
        valid_docs.append(doc)
    else:
        print(f"Skipping document: total size {total_size} exceeds limit")

vs = await PineconeVectorStore.afrom_documents(
    documents=valid_docs,
    index_name=index.name,
    embedding=embeddings,
    namespace=namespace
)