I have been utilizing Pinecone to store a large dataset with appended metadata for each vector. While querying the vectors, I use metadata filters to retrieve specific entries.
For example, a filter like ministerium: "Udenrigsministeriet" returns the expected results. However, I encounter issues when the metadata contains spaces or special characters, such as hyphens. Queries with a filter like ministerium: "Børne- og Undervisningsministeriet" or ministerium: "Indenrigs- og Sundhedsministeriet" yield no results, even though I have confirmed that vectors with this exact metadata exist in the database.
Could there be an issue with how the Pinecone query language handles spaces and special characters within metadata filters? Or is it possible that I may have misunderstood the proper application of the $in operator or other aspects of query formation for such cases?
Any insights into the proper use of metadata filters with spaces and special characters, or documentation on this matter, would be greatly appreciated.
It appears that the discrepancy between query results and expected metadata values is due to an automatic sanitization process on the main query interface. This process seems to condense multiple spaces down to a single space when values are copied directly from the metadata. As a result, this sanitization leads to mismatches when executing queries since the metadata in the database contains two spaces, while the query interface reflects only one after sanitization.
To address this, it may be more user-friendly to implement one of the following solutions:
Strict Validation on Insertion: Enforce a validation rule that prevents the insertion of metadata with consecutive spaces. This can be achieved by either normalizing the metadata to single spaces upon entry or by throwing an error to alert the user to adjust the metadata manually.
Consistent Sanitization Across Interfaces: Ensure that the display of metadata values in the query interface is consistent with the actual stored values. If sanitization is to occur, it should be applied uniformly so that the displayed and stored values match. This would prevent confusion when copying and pasting between the interface and the query input.
The goal is to maintain consistency between what users see, what they query, and what is stored. If metadata is to be presented in a sanitized format, the same rules should apply to the storage and retrieval mechanisms to avoid any mismatches that could lead to zero results in queries that otherwise should return valid entries.
Upon further investigation into the mismatched query results, it has come to light that the Edit vector interface displays the metadata values without any sanitization, revealing that the actual stored metadata includes two spaces. This crucial detail is obscured on the main query page due to an automatic sanitization process that condenses multiple spaces into one when displaying metadata values.
This inconsistency between the Edit vector display and the query interface is the root cause of the issue. Users relying on the displayed metadata from the query page to construct their queries will face unexpected results due to this sanitization discrepancy.
I am not sure this is the intended behavior, but it caused a great deal of confusion due to being unable to retrieve any vectors using the copy pasted value from the metadata.