I’m seeing this error for exceeding the max size on meta data.
However, the only metadata Im storing is an uuid string
Any ideas?
I’m seeing this error for exceeding the max size on meta data.
However, the only metadata Im storing is an uuid string
Any ideas?
Hi @ntkris,
What’s the index you’re using? And which region is it in? I can take a look and see what’s going on.
thanks cory, i changed some of the code and the error went away
I am having a similar issue @Cory_Pinecone my metadata is also a single uuid and it shows me a similar message, could you help?
Hi @Ayan,
If you’re storing UUIDs as metadata, the problem is almost certainly due to the cardinality of the data. Indexing metadata in Pinecone takes up resources that should be going to storing vectors; for that reason, we recommend using selective metadata filtering and only filtering on those fields that are crucial to your queries. You can still return those unindexed fields with queries, but they won’t take up extra space this way.
Also, consider storing your UUIDs as the vector ID itself rather than as metadata. This way, you can fetch those vectors when needed without any filtering involved.
My metadata filtering query requires the UUID. The other possible option we can use is a namespace, would that alleviate the problem?
Also @Cory_Pinecone, could you please explain a bit more about what do you mean by the cardinality of the data? I thought that the metadata field was limited to 10248 bytes per vector. So I am a bit confused here.
Cardinality has to do with the uniqueness of the metadata. If you have 10M vectors, and they all have a single metadata field that can only be one of two choices (e.g., a boolean), then you’ll have a very low cardinality of two. But if each one has a unique field, which no other does, your cardinality would be 10 million. That data has to be stored somewhere, so setting aside space in the index to keep track of it necessarily means you have less space available to store your actual vectors.
The limit per vector has to do with the total values of the metadata fields being stored. Indexing that data, on the other hand, is a different issue. That’s where concerns about cardinality come into play.
If you segment your vectors into namespaces, keep in mind that your queries will only run on a single namespace at a time. So if you need to compare two vectors associated with different UUIDs, that won’t be possible if they’re in separate namespaces. But if you don’t have to compare vectors with different UUIDs, then this approach could work. But keep in mind that there is still some overhead with namespaces, as well. So if each namespace only has a handful of vectors, you may still run up against storage limitations.
What’s the filter you have to run that requires the UUIDs be in place? And how many vectors do you plan on having associated with a given UUID?
In essence what I am doing is uploading a lot of vectors which has a single UUID as metadata. According to your explanation, this should not lead to a high cardinality right?
But the UUID is an identifier for a document in our database. Which means that each document has a different UUID. So if I try to upload many documents, that might cause issues?
If they all have the same value, then you’re right; that should not introduce any cardinality problems.
How many vectors per UUID do you anticipate having? And what’s the total count of vectors you expect?
I am not exactly sure about the numbers but a reasonable estimate would be around 450 vectors per UUID. And the total count should run well into the millions.
Would this introduce a high cardinality?
Hey @Cory_Pinecone I think I found the issue. I was using the langchain library to upsert my data and here we can see that they are actually appending some text to the metadata while upserting. This is causing the large size of the metadata and messing everything up. I am posting all this for future reference in case any one runs into a similar issue.
Make sure that any text splitter you use from langchain does not cause huge sizes for each chunk. I recommend using the RecursiveCharacterTextSplitter
for this.
Interesting, good to know this about langchain.
Thanks for sharing @Ayan.
By doing this I get another error:
Pinecone.from_documents(chunks, embeddings, index_name=INDEX_NAME, namespace=NAMESPACE, metadatas=[{'varialbe_name'}])
TypeError: langchain.vectorstores.pinecone.Pinecone.from_texts() got multiple values for keyword argument 'metadatas'
Previously I was using the following and it worked fine, except for the memory error:
Pinecone.from_texts(
texts=[cleaned_text],
embedding=embeddings,
#metadatas=[{'source':entry.id}],
batch_size = 16,
index_name=INDEX_NAME,
namespace=NAMESPACE
)
I tried out using the from_texts method for smaller sentences and documents, and it worked fine.
Actually I am running a loop to index about a 100 different documents. It throws another after around 20 examples.