Metadata types and (long) lists of booleans

dany.majard1 · January 9, 2024, 5:09pm

Hello,

We need to implement a way to perform metadata search as to exclude items based on different “perimeters”, which may number in the hundreds or thousands. To perform this task, we considered the following solutions:

store a list of uids of “perimeters” for which the boolean flag is 1 and use the $in operator. Unfortunately we found that the footprint of the elements of the list is too big. (cannot reduce it to a few bytes, it seems incompressible at around 44 bytes)
store a list of booleans and use index to retrieve the relevant one for filtering. Type unsupported and operator inexistant.
store the booleans in integers using the binary representation and binary logic operators to perform update and filtering. Type supported but operator unsupported

For example, assuming 4 perimeters and doing a search pertaining to perimeter 3, the solutions above would look like the following. Considering item X, which is within perimeter 1 and 3:

item X has metadata {perimeters:["1", "3"]}, filter with "3" $in perimeters
item X has metadata {perimeters: [True, False, True, True]}, filter with perimeters[2] is True (assuming indexing from 0)
item X has metadata {perimeters: 13} (i.e. 1101 in binary), filter with perimeters $binaryAND 4 = 4 (i.e. filter is 0100 in binary)

Would you have a solution for this problem? Is there a plan to allow such functionality in the near future?

Unfortunately, neither of these are currently possible.
Thank you!

silas · January 9, 2024, 6:38pm

Hi @dany.majard1,

I think it might help to say a bit more about your use case, the entities involved and how they relate to each other, and some examples of high level queries you’re trying to enable.

There may be an alternate way to structure your data in Pinecone to achieve the outcome you’re after.

dany.majard1 · January 10, 2024, 9:50am

Hello @silas !

We have millions of documents and complex, dynamic rules as to which is to be seen by each of our users. This results in what I called above a “perimeter”. We need to perform ANN on a sub-set of documents corresponding to a user’s perimeter on-demand. We may have hundreds, or thousands of users. There are also other parameters that we wish to reduce the query results by.

There is no relationship at all between the perimeters. Does that help?

silas · January 12, 2024, 4:59pm

By perimeter, is that comparable to a geo-fence? If a user is within some geographical boundaries, then it should activate for that user the sub-set of data that is relevant for that region?

(I understand that your perimeter may not be earth-based coordinates – just for analogy)

dany.majard1 · January 12, 2024, 9:12pm

Hello Silas,

This is indeed not that different from geo-fencing. If we were considering marketing for example, assume our users defined themselves their regions of interest, within which they can dynamically and individually select ad topics within a large taxonomy and explicitly exclude individual brands within a large set of brands if wanted.

silas · January 12, 2024, 9:36pm

When performing semantic search, are they needing to run the same semantic search query across multiple regions of interest at a time and taking the top_k matches across those regions, or should those searches be limited to querying within a single region?

Also, are the regions/content the same for all users, and are just selected/activated for a given user based on that user’s preference? Or for any two users would they have different regions and/or different content?

dany.majard1 · January 15, 2024, 9:39am

Thank you for picking this up Silas,

Each user has a different “perimeter” which, as the example showed, can be a union of different goegraphies. We are currently running a single query for the whole perimeter, not differentiating between the geographies. That may be a future step, if technically possible.

As for the regions + taxonomy (let’s omit the list of brands), it is fixed, but large. E.g. it is almost impossible that two users have the same (high cardinality of taxonomy and low cardinality of user-base)

Does that help?

silas · January 16, 2024, 5:29pm

I’m wondering if it would work for you to store each region’s data in a namespace.
The implication is that because you cannot query across namespaces, you’d need to run a given user query separately against each namespace region. For example, if a user selected 10 regions then you’d need to run 10 queries (in parallel) and combine/aggregate/filter the results on the client.

This would be ok if the number of regions is high (you said it is), as long as the number of regions for any given user is low (say, <10?).
(If a user can belong to hundreds or thousands of regions, this approach becomes less tenable.)
This approach may be more or less effective based on whether the user is running the query and waiting for the results live, or whether the query can be performed in the background ahead of when the user needs to make use of the results.

dany.majard1 · January 19, 2024, 10:27am

Hello Silas,

Using Namespaces for geographical regions is not a bad idea, and I will propose it to the business. But it doesn’t solve the full “perimeter” problem.

Let me try to describe it differently with a very concrete example:
Assume that we have a low user count say 1k, making single region queries only for restaurants. But each query is very tailored to them, it needs to run on all restaurants visited by their social network over the last 2 years, extended to 2 hops (friends and friends of friends). The user has had the possibility to manually discard some of their friends from the query, and some food distributors that supply restaurants. The result is a very complex mapping user → vectors to consider in the query, where the images of this mapping overlap (no use case for namespaces as restaurants have multiple suppliers and the social networks may overlap as well)

I believe that this describes reasonably well why I would want to store the “include in query for user X” flag in metadata.

silas · January 20, 2024, 12:49am

Yeah, that is a pretty complex use case. I wonder if a graph DB is more in line with your needs?

dany.majard1 · January 24, 2024, 5:07pm

Should I understand that there’s no plan for pinecone to implement bitwise operators?
There is no social network in our use-case, but that was the best way for me to describe the fact that the function customer → querriable vector that we need to implement is not a union or intersection of a few simple filters. We could have a graphDB storing these relationship plus a certain number of neighbours (a vector DB is a fully connected graphDB, which is silly to implement).

Of course we can implement the filtering post-query, hoping that by casting a large enough net, we can get to a reliable number of neighbours, but that is slower and more (computationally) expensive for a worse result.
Taken the management of the index of Customer, storing these flags in integers and using bitwise operator is the simplest engineering solution we can think of.

silas · January 24, 2024, 5:14pm

Should I understand that there’s no plan for pinecone to implement bitwise operators?

Sorry, I don’t have access to the internal roadmap.

system · February 7, 2024, 5:14pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.