Hi All - I’m new to Pinecone and looking for additional guidance on the best pay to structure my indexes for a multi-tenant solution.
Key requirements:
Multi-tenant: Segment customer database to isolate their data from other customers’ data.
Cross-tenant search: There is no need to search across multiple customers’ data
Workspace: Each customer can have N number of workspaces, like a joint group of things and a topic. Conversations, documentation, and other things are all grouped under each Workspace to allow for context and additional data working. Each Workspace would be a CRUD where there would be a lot of updates.
Based on this, I assemble a high-level drawing to validate whether I am operating in the best approach.
In general, I think using the namespace as a segmenting construct is appropriate along with some sort of logical naming convention to isolate customer data.
But there are a few considerations/restrictions.
Namespaces are contained in an index, not the other way around.
In your diagram it appears that a namespace would contain multiple indexes, but this is not possible.
It is also not possible to “nest” a namespace within another namespace.
Each index can only contain vectors of the same size and can only support one similarity metric, so you need to be sure that any data you’re planning to store in the index would be compatible. So if one customer needed 1028 embeddings and another needed 1536, you would need to put them in different indexes, not just different namespaces.
Metadata is primarily used for filtering the results of a similarity search, but it sounds like you may have something else in mind. Can you elaborate on what you mean by the workspace being CRUD? The only reason to store data in Pinecone is if you plan to do vector search over the data to find matching items. If you’re going to lookup data in Pinecone by using their unique IDs, you probably just want to keep that data in a more traditional (and cheaper) datastore.
Thank you for your feedback on my Pinecone vector database design. I have revised the design to take into account your recommendations.
The revised design is as follows:
Each index is dedicated to a single type of data. This makes it easier to find the specific data I seek. For example, I have indexes for lean canvases, strategies, documents, and conversations.
Each index has its unique namespace. This helps to isolate the data for each company. For example, I have a namespace for each company that uses my application.
The metadata field stores additional information about the data, such as the workspace name or the conversation ID. This information can be used to filter the results of a similarity search. For example, I can use the workspace name to find documents and conversations for a particular workspace.
This revised design is more scalable and efficient than the original one. I am eager to experiment with it and see how it performs in my application.
I would also like to explore your additional feedback and suggestions. For example, you mentioned that I should avoid defining the metadata of workspace_name-uuid in the workspaces index. I am open to your thoughts on this.
Workspace-lean-canvas and workspace-strategy are examples of the representation of my traditional database. The goal is for an AI to have all the context available within a single workspace. For instance, if a user is working on a lean canvas, the AI should be able to access the relevant documents, conversations, and strategies for that workspace.
The main concern I’d still have is whether the primary method by which you retrieve a given workspace-* record is via a primary key lookup vs. a semantic search. If semantic search, Pinecone makes sense. If primary key, then another DB like be Dynamo, etc would likely be more appropriate (and much cheaper).