Clarification on how namespaces work

jspicher · February 8, 2023, 1:57pm

Hello,

I have a question regarding the namespace values for vectors:

If l insert a handful of vectors with a namespace of “test-1”. And another handful with a namespace of “test-2”

When l run a query - without passing the namespace value - it appears that these namespaces test-1 and test-2 are ignored from my query and not checked for similarity nor returned.

It is only after l add the namespace=test-1 parameter to my query that pinecone will return results from the specified namespaces. Having said that what if l want to query through all vectors regardless of namespace? Or perhaps l want all where namespace = test-1 OR test-2? Is that possible?

Cory_Pinecone · February 8, 2023, 2:51pm

Hi @jspicher,

That’s correct, queries can only be run in one namespace at a time. This includes the null or blank namespace, which is separate from any namespaces that are actually named.

We don’t have a mechanism to search across multiple namespaces or to perform a JOIN like action. But I’m happy to share this with our product team as a potential future enhancement. Can you share more of what you’re thinking of when searching multiple namespaces? What’s the use case you’re envisioning that would be improved by that? Any additional color you can share would help with design and prioritization.

jspicher · February 8, 2023, 3:03pm

Hello again Cory,

Thank you ever so much for your time & response. I’m not necessarily suggesting a new feature; just that l’m new to pinecone/vector indexing and expected that if no namespace was passed it would query all namespaces within the index; not just those with no namespace specified. So really l was just trying to better understand how l should think about my architecture and setting up my data within the indexes.

As an example use-case let’s assume l’m my goal is to create a semantic search engine for my website.
Each page has:

Article Content
Title Tag
Keywords Tag
Meta Description Tag

If l wanted to weight each of these elements a little differently (for example; say the document title weighed more heavily than the content). Would you recommend setting each of these in their own namespace and thus requiring 4 API queries per user search, or should this type of use-case be done within the metadata of each index; adding a “data-type” meta field for example?

Thanks again,

Cory_Pinecone · February 8, 2023, 4:30pm

I see. Yeah, in this case, if you’re going to search across multiple tags at once, I would use metadata to separate them. Then you can use the $in operator to search for several. Or not filter at all and search all of them.

alberduris · May 17, 2023, 3:31pm

Hi @Cory_Pinecone,

I’m highly interested in this feature. Is there any update? Please, let me know.

Sean · May 17, 2023, 5:37pm

@Cory_Pinecone - If I am storing data in a multi-tenant environment (basically I store vectors for CustomerA, CustomerB, CustomerC and their data must NEVER overlap) are Namespaces the recommended approach to store the data (querying by namespace=CustomerA, for instance), or should I use Metadata ({‘customer’: ‘CustomerA’})?

jspicher · May 17, 2023, 5:56pm

While there is no limit to the number of namespaces an index can have (source: https://docs.pinecone.io/docs/limits#:~:text=request%20is%201%2C000.-,Namespaces,number%20of%20namespaces%20per%20index.) I would recommend metadata. I think you’ll run into scaling issues if using namespaces.

alberduris · May 17, 2023, 6:00pm

I’m curious why you think you’ll have scaling issues using namespaces @jspicher.

galarcon · May 19, 2023, 5:25am

Hi, guys. I’m passing for the same problem. I’m not sure to create an index per customer or a namespace per customer, thinking that every customer will have billions of vector in their space/db, by now, I’ll use one index per customer, one global namespace for all customer documents(vectors) and one namespaces for every special search categories.

If you have some recommendation about this topics, please reply it.

diego · May 31, 2023, 9:03pm

Great answer @jspicher ! Thank you, this shed light as to what could be best practices. In @Sean’s example, would a namespace of “Customers” be ideal or would “Customers” just be another metadata?

colemanng · June 2, 2023, 9:14am

Hi @jspicher thanks for your great answer, I face the same problem too.

Our planned use case is one namespace for each customers for their knowledge base (say, cust1, cust2, etc), and we will also make use of the “null” namespace to store general information. If search across multiple namespace is possible, we can search “cust1” + “null” or “cust2” + “null” etc.

Appreciate if you will have some alternative suggestion. Thank you.

jspicher · June 2, 2023, 2:39pm

Disclaimer: I am by no means a pinecone or vector index expert. I’m just as new to this stuff as most of you. My area of expertise is working with relational database structures such as SQL as a DBA for the past ~20 years.

Having said that; my approach to the way l generally perceive namespaces and metadatas as it relates to a pinecone index is that l think of the namespaces as individual tables and the metadatas as columns of said table(s). Even though a vector index is much more similar to a doc-type database (such as MongoDB) than your classical relational database structures (MySQL etc). This very well may be an oversimplification and dated way of perceiving the two features, and it would be helpful if someone who has intimate knowledge of exactly how these features work within the pinecone ecosystem were to speak up on the subject.

Having said all of that, here’s my take. In my example (Clarification on how namespaces work - #3 by jspicher); metadatas are the way to go. If l were to put each of my data points into their own namespace (Article Content, Title Tag, Keywords Tag, Meta Description Tag)
; each search would require multiple api calls (4 each in my case). This is where I express concerns of scalability. Even without knowing exactly how pinecone handles namespaces in the backend.

However, with data points stored as metadatas, it would only require a single API call.

Now in @Sean 's example (Clarification on how namespaces work - #4 by Cory_Pinecone), where he made it clear he would NEVER search across more than a single customer record at a single time, putting each customer into it’s own namespace would work. Though there are no limits to the number of namespaces an index can have with pinecone, l have no idea if you would run into scaling issues with millions of namespaces or not.

But good luck to him if he ever has to search for a customer by something other than the namespace id, for example, a customer named “John Smith”. That would require an iteration over every customer namespace. It doesn’t seem very scalable to me, anyone else?

diego · June 2, 2023, 8:30pm

Thanks @jspicher. I’ve also, conceptually tied namespaces to tables. I’m sure we’re wrong here but it helps to tangle this vector concept.

The issue I’m having is that I’m NOT seeing any actual use case for namespaces. Even taking into account any security concerns, it doesn’t make any sense to me. The way I see it that having a metadata approach seems both scalable and doesn’t limit the model in any way. This is taking into account that searches can only be done in a single namespace. So @colemanng , not sure how you would search for “cust1” + “null”. I might be mistaken here but more detail can be found: Using namespaces

jspicher · June 2, 2023, 8:48pm

In my use case; l was creating a semantic search engine for our network of websites.
In that instance, l put each website into it’s own namespace.

While of course, this could have all been done with metadata storing the name/value for each of the sites, doing it with namespaces seemed to be the proper way to go.

It is clear in reading the documentation that the more “unique” metadatas you have within your index; the slower the reads will be. Nothing similar is said about having a vast amount of namespaces.

jamahlmd · June 8, 2023, 6:35pm

@jspicher With that approach to store every customer/website data in a separate namespace,

Do you ever encounter that the data still seems to overlap?

To test I stored text in namespace “773001911801952213043-1” saying: “The company name is ABC”
And I stored text in namespace “773001911801952213043-3” saying: “The company name is XYZ”

When querying the pinecone store with bot namespaces, it seems to return the company as ABC always…

Cory_Pinecone · October 17, 2023, 5:53pm

Hi all, just wanted to circle back on this one.

While metadata is more flexible than namespaces because you can mix and match your metadata, using namespaces for multitenant hosting is always the recommended best practice. There are a couple of reasons for this.

Namespaces ensure no mixing of customer data

If you have multiple customers with their own unique sets of data, you want to make sure that any queries run for Alice don’t sample data held for Bob. Using a namespace gives this guarantee since a query can only operate in a single one at a time.

If you do need to merge or mix data from multiple customers, you can do so in your app with post-query processing.

Metadata cardinality can impact performance

One of the lesser-known issues with metadata, which I don’t see mentioned anywhere on this post, is that of metadata cardinality impacting performance. Essentially, the more unique values you have for a given metadata key, the more space the internal metadata index has to consume. And that’s space that is not available for your vectors, limiting how many you can store in each pod and possibly increasing your costs in the process.

Also, having a high degree of cardinality can slow write performance, as each upsert or update to a vector has to update the metadata index. This can impact write performance, leading to data taking seconds to become fresh and available and possibly leading to 504 errors on reads until the write queue is finished processing.

These issues can be mitigated by using selective metadata filtering, and only indexing those fields you need for filtering.

For these reasons and more, we recommend using namespaces over metadata for multitenant applications.

jad.eljerdy · April 25, 2024, 7:43pm

Late to the party, but I’m planning on using serverless for my SaaS to scale (what a technology you guys created there!), there’s a limit on namespaces (10K) for the standard plan, and I’m planning on giving every tenant a namespace, I know I can dynamically create indexes when namespaces max out, and given the limit of 20 indexes per project, I will roughly have 200,000 namespace allocations (200K users). There’s also a limit of 20 projects per organization, but there seems to be noway to create projects programmatically to avoid this restriction. What do you advise to do here ? Is your recommendation to go for namespaces for multi-tenancy still advisable here ? Or due to the new limitation we should stick with metadata ?

Also, any plans to increase those limitations on the long run ?

Thanks!

patrick1 · April 30, 2024, 10:32am

Hello @jad.eljerdy,

We would still recommend using namespace, you can see all of the options in the documentation - understanding-multitenancy

If you’re reaching the limits, please contact us and we can talk through the options.

Regarding the ability to create projects programmatically, this is something we are looking into.