"Starts with" metadata filter

Jasper · November 2, 2023, 10:55am

A full regex metadata filter would be nice, but a simpler “starts with” would be very nice for a start.

gdj0nes · December 1, 2023, 9:40pm

Thanks for sharing! What sort of use case do you have for “starts_with” queries.

Jasper · December 4, 2023, 9:04am

Hmm let me see if I can remember. I think I had a metadata field that contained three or four properties that I wouldn’t want to have as separate metadata fields. So an “ID” field would include document_type, page_number and some other stuff. They so it happened that I had to update an existing metadata field but only to some document types. (the first part of the “ID” field).

Another instance was that some documents always started with a specific phrase and I had to update some metadata field of only that documents.

I found workarounds, but at the time it look as a good feature to have, as niche as it may be

bruno.camara · March 28, 2024, 1:02pm

One use case where the “starts_with” would be helpful is when using Bedrock Knowlege Bases. When importing from the S3 bucket associated with the KB, there’s a metadata field named x-amz-bedrock-kb-source-uri. If my bucket keys are organized into “folders”, it would be great to have the capability to filter using “x-amz-bedrock-kb-source-uri starts with s3://my-bucket-name/myfolde1/”

norewindz · April 1, 2024, 11:34pm

A bit unrelated but I think also a good idea would be a “starts with” vector id filter.

In the delete docs, pinecone suggests prefixing ids with a unique id.

If we can delete like this, why can we not query? We don’t need to worry about metadata cardinality this way.

Also i think it is important that we can query starts with with an array, so we can search through multiple vectors matching. Basically a metadata filter, but the metadata is the prefixed-id of the vectors!

A potential issue with starts with is overlapping ids, which could bring in unwanted results, so it isn’t perfect! But a simple fix would be check starts with idprefix_? Hmmm

Would it be computationally expensive to check every metadata/id starts with? Maybe this is why we can only simply apply a filter on metadata

norewindz · April 1, 2024, 11:39pm

In the docs for delete with id prefix, there is no starts with.

Users should store the id prefix and chunk total themselves and assemble the delete batch operation.

So there is currently no starts with support. I think it must be computationally expensive!