Metadata Limit Exceeding in Haystack

rstang · June 7, 2023, 7:50pm

Hi, I’m a newbie trying to build LLM applications by integrating Haystack and Pinecone.

I have about 50 text files totaling upto around 50MB which I imported in Haystack and processed using

preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=150,
    split_respect_sentence_boundary=True,
    split_overlap = 50
)
docs = preprocessor.process(all_docs)

from haystack.document_stores import PineconeDocumentStore

document_store = PineconeDocumentStore(
    api_key=pinecone_key,
    environment='northamerica-northeast1-gcp',
    similarity="dot_product",
    index='scraped-data',
    embedding_dim=1536
)

Then, I’m trying to store in the document_store using

document_store.write_documents(docs)

I’m getting the following error –

ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'application/json', 'date': 'Wed, 07 Jun 2023 19:25:23 GMT',
'x-envoy-upstream-service-time': '2', 'content-length': '115', 'server': 'envoy'})
HTTP response body: {"code":3,"message":"metadata size is 45180 bytes, which exceeds the limit of 40960 bytes per 
vector","details":[]}

Can anyone help out? Thanks!

dra · June 19, 2023, 1:11am

Hello!
This error occurs when attempting to save documents to the PineconeDocumentStore. According to the error message, the size of the metadata exceeds the limit of 40960 bytes per vector, resulting in a Bad Request and the error response.

In Pinecone, each vector can have associated metadata. However, there is a size limit for the metadata (40960 bytes). As indicated in the error message, the metadata size of the document you are trying to save exceeds this limit, causing the error.

To resolve this issue, you need to reduce the size of the metadata to stay within the limit. Here are some approaches you can consider:

Reduce the metadata size: Reduce the amount of information stored in the metadata. Only include the necessary information and omit any unnecessary details.

Compress the metadata: Compress the metadata to reduce its size. You can use compression algorithms to compress the metadata before storing it and decompress it when needed during retrieval.

Store metadata separately: Instead of storing the metadata directly with the vector, you can store only identifiers (e.g., document IDs) and keep the actual metadata in a separate storage (e.g., a database).

By implementing one or more of these approaches and ensuring that the metadata size remains within the limit, you should be able to resolve the error.