Metadata size error

ntkris · February 7, 2023, 1:18am

I’m seeing this error for exceeding the max size on meta data.

However, the only metadata Im storing is an uuid string

Any ideas?

Cory_Pinecone · February 7, 2023, 3:58pm

Hi @ntkris,

What’s the index you’re using? And which region is it in? I can take a look and see what’s going on.

ntkris · February 7, 2023, 5:57pm

thanks cory, i changed some of the code and the error went away

Ayan · February 27, 2023, 1:32pm

I am having a similar issue @Cory_Pinecone my metadata is also a single uuid and it shows me a similar message, could you help?

Cory_Pinecone · February 27, 2023, 4:11pm

Hi @Ayan,

If you’re storing UUIDs as metadata, the problem is almost certainly due to the cardinality of the data. Indexing metadata in Pinecone takes up resources that should be going to storing vectors; for that reason, we recommend using selective metadata filtering and only filtering on those fields that are crucial to your queries. You can still return those unindexed fields with queries, but they won’t take up extra space this way.

Also, consider storing your UUIDs as the vector ID itself rather than as metadata. This way, you can fetch those vectors when needed without any filtering involved.

Ayan · February 27, 2023, 5:07pm

My metadata filtering query requires the UUID. The other possible option we can use is a namespace, would that alleviate the problem?

Ayan · February 27, 2023, 5:09pm

Also @Cory_Pinecone, could you please explain a bit more about what do you mean by the cardinality of the data? I thought that the metadata field was limited to 10248 bytes per vector. So I am a bit confused here.

Cory_Pinecone · February 27, 2023, 5:29pm

Cardinality has to do with the uniqueness of the metadata. If you have 10M vectors, and they all have a single metadata field that can only be one of two choices (e.g., a boolean), then you’ll have a very low cardinality of two. But if each one has a unique field, which no other does, your cardinality would be 10 million. That data has to be stored somewhere, so setting aside space in the index to keep track of it necessarily means you have less space available to store your actual vectors.

The limit per vector has to do with the total values of the metadata fields being stored. Indexing that data, on the other hand, is a different issue. That’s where concerns about cardinality come into play.

If you segment your vectors into namespaces, keep in mind that your queries will only run on a single namespace at a time. So if you need to compare two vectors associated with different UUIDs, that won’t be possible if they’re in separate namespaces. But if you don’t have to compare vectors with different UUIDs, then this approach could work. But keep in mind that there is still some overhead with namespaces, as well. So if each namespace only has a handful of vectors, you may still run up against storage limitations.

What’s the filter you have to run that requires the UUIDs be in place? And how many vectors do you plan on having associated with a given UUID?

Ayan · February 27, 2023, 5:50pm

In essence what I am doing is uploading a lot of vectors which has a single UUID as metadata. According to your explanation, this should not lead to a high cardinality right?

Ayan · February 27, 2023, 5:57pm

But the UUID is an identifier for a document in our database. Which means that each document has a different UUID. So if I try to upload many documents, that might cause issues?

Cory_Pinecone · February 28, 2023, 1:57am

If they all have the same value, then you’re right; that should not introduce any cardinality problems.

How many vectors per UUID do you anticipate having? And what’s the total count of vectors you expect?

Ayan · February 28, 2023, 4:28am

I am not exactly sure about the numbers but a reasonable estimate would be around 450 vectors per UUID. And the total count should run well into the millions.

Would this introduce a high cardinality?

Ayan · February 28, 2023, 2:53pm

Hey @Cory_Pinecone I think I found the issue. I was using the langchain library to upsert my data and here we can see that they are actually appending some text to the metadata while upserting. This is causing the large size of the metadata and messing everything up. I am posting all this for future reference in case any one runs into a similar issue.

Make sure that any text splitter you use from langchain does not cause huge sizes for each chunk. I recommend using the RecursiveCharacterTextSplitter for this.

Cory_Pinecone · February 28, 2023, 7:01pm

Interesting, good to know this about langchain.

Thanks for sharing @Ayan.

muhammad · September 12, 2023, 8:08pm

By doing this I get another error:

Pinecone.from_documents(chunks, embeddings, index_name=INDEX_NAME, namespace=NAMESPACE, metadatas=[{'varialbe_name'}])

TypeError: langchain.vectorstores.pinecone.Pinecone.from_texts() got multiple values for keyword argument 'metadatas'

Previously I was using the following and it worked fine, except for the memory error:

Pinecone.from_texts(
                             texts=[cleaned_text],
                             embedding=embeddings,
                             #metadatas=[{'source':entry.id}],
                             batch_size = 16,
                             index_name=INDEX_NAME, 
                             namespace=NAMESPACE
                             )

I tried out using the from_texts method for smaller sentences and documents, and it worked fine.

Actually I am running a loop to index about a 100 different documents. It throws another after around 20 examples.