How can I filter based on date in metadata? can’t use integers based on year, month, day because it’s not valid. (e.g. 2024-01-01 should be greater than 2023-12-31, but it will be false based on month and day separately. can’t use string representation because $gte expects number). tried datetime iso format, still same error.
inserted this:
{‘id’: ‘test’,
‘metadata’: {‘expiry_date’: datetime.datetime(2023, 5, 4, 0, 0)},
‘values’: […]
}
but when querying:
ApiException: (400) Reason: Bad Request HTTP response headers: HTTPHeaderDict({‘content-type’: ‘application/json’, ‘date’: ‘Sat, 25 Mar 2023 16:09:43 GMT’, ‘x-envoy-upstream-service-time’: ‘0’, ‘content-length’: ‘119’, ‘server’: ‘envoy’}) HTTP response body: {“code”:3,“message”:“the $gte operator must be followed by a number, got "2022-01-01T00:00:00" instead”,“details”:}
But wouldn’t that void the advice on using low-cardinality datatypes for metadata filters?
High cardinality consumes more memory: Pinecone indexes metadata to allow for filtering. If the metadata contains many unique values — such as a unique identifier for each vector — the index will consume significantly more memory. Consider using selective metadata indexing to avoid indexing high-cardinality metadata that is not needed for filtering.
It would seem that one of the prime chat use cases is “memory recency” … so where you only want to select recent items from memory - or at least to weight those items higher.
you are right but i think it’s a tradeoff. if you are omitting the hour, minute and second parts, it would be maximum 365 a year. (not so high cardinality). here’s the code i used:
@bigrig, it’s a balancing act. Yes, having high cardinality can impact performance. But Unix epochs give you the most flexibility when using value comparisons (gt, gte, lt, lte).
I’ve found if you set the resolution for the timestamps to be larger than by the second you can get the benefits of easy comparison while avoiding problems with cardinality. It depends on your use case but setting the resolution to the minute, quarter hour, hour, or even day could be the right solution.
Thank you for the prompt answer. I had actually checked on that page before. The thing here is that even though I upsert dates as strings and even put the str() around those dates when I fetch them from the db they appear as datetime.date. On the attached prints, look specifically at the parts I circled in red (FYI: ‘data’==‘date’, it’s Portuguese).
Oh… nice! That is actually quite cool if you want to sort the results based on date, but I think this will only work after these vectors are fetched, as you can then cast these vales in your backend code. These values are considered as strings in the metadata, and I am guessing the object is just stringyfied to look like it does. But this isn’t that bad for some implementations. Interesting…
You are right, although it says that date is a datatime.date type in my output, if I search by the initial string date it in fact retrieves the vector I want (‘id’ : ‘4’) in the first place.
However, this only happens because I used ‘$eq’, which can be used for strings. If I were to filter by, say ‘year < 2020’, there wouldn’t be a way to do it, right? Since ‘$lt’, ‘$gt’, $lte’ and ‘$gte’ only work for numbers and not for strings. How could one do that kind of data filtering/sorting in a query?
$eq ‘09/13/2018’ returned the correct vector. Now that is strange and awesome! No idea how this worked, but other operators will not work. I suggest using the Unix epochs trick mentioned above or if you play around with your version and see if any other operators will work.