Multiple indices, or 1 index to rule them all?

karim.wahba · December 5, 2023, 4:10pm

Hi there,
question for the community about index design. Suppose we have different entities we want to embed, for example a product and a supplier. I see two options:
option A: embed each entity in it’s own index, with appropriate metadata
option B: embed both entities in the same index, and have an entity_type field as a metadata filter, so that one can still restrict to the entity of interest when needed.

What are the pros and cons of each approach in terms of performance and application? We are having an internal debate and can see arguments to both sides. Ultimately any choice is reversible, but just wondering if others have faced a similar decision.

silas · December 5, 2023, 6:11pm

The main factors to consider are cost and dimension size.

All vectors in an index must use the same dimension size and the same similarity metric (cosine, euclidean, etc). So if your entity use cases differ by embedding model or embedding size, or use a different similarity metric, then you don’t really have a choice and you have to use separate indexes.

But if each of your entities uses the same dimension size and similarity metric, then you can likely save on cost by using one index, because you’ll pay a minimum cost for each index up front, especially if each of the entities is a relatively small amount of data vs just paying for one index

You’ll also likely want to use namespaces as a logical bucketing mechanism for each entity, as that will make it easier to manage the data for each entity separately from the other entities. There are no limit to the number of namespaces in an index.

The other factor to consider would be operational overhead of managing multiple indexes, monitoring capacity, monitoring query performance, scaling up, etc. vs just doing all that for one index.

silas · December 7, 2023, 5:04pm

@karim.wahba, hope this helps, but let us know if you have any other questions.