Pinecone Clarifications

We are exploring pinecone for our organization search platform… I need below clarifications and it help us to proceed further with the POC activities.

  1. I have user data available on our database which includes images, pdf, etc…, We need to move all those data to pinecone to perform search. If I want to send below data to pinecone then do I need to convert this data into vector embeddings using ML libraries and then do I need to store the data into pinecone? If yes, how to embed all these fields in one go and what ML library we can use here? Please share the references… I have seen few pinecone youtube videos and they used random.random90 functions to generate dummy vector id’s for each data. But in real time what do I need to do?

“metadata”: {
“first Name” : “John”,
“last Name” : “Smith”,
“address1” : “7631 Wildwood”,
“address 2” : “APT 1234567”,
“state” : “Utah”,
“country” : “USA”,
“zip” : “940958465”,
“time_stamp”: 0
}

  1. After inserting the above record then I have to perform the search on that record. So when user searching for firstname as “John” then we have to return the results. In this case do we need to convert the data input string to vector embeddings and then do we need to send that to pinecone to retrieve the results? Please correct me if I am wrong. Please suggest some documentations for the same. I dont find much information on web.

  2. How about high availability, is this fully managed by pinecone? If yes, may I know how we can ensure high availability of the system?

  3. If we want to upload/index huge amount of data into pinecone then what is the procedure and please share the examples. Also pinecone website says that it supports only python at this moment and Java, Go will be available soon. Is java available now?

5)Can we perform below search on pinecone?
User made lots of transactions at different places using his credit card, when he is comes and search a word “Pizza” then system should return all the transactions he made on pizza shops like Dominos pizza, Pizza Hut, etc.,

  1. What kind of maintenance activity we need to perform on indexes?

  2. What kind of index management activities we can do? Also how pinecone will ensure high availability, for example taking snapshots, recovery from failure, etc…

Could you please answer the above questions for better understanding of the product.

2 Likes

Hey @sprabakar01 great to hear that you’re considering Pinecone!

  1. For any object you’d like to search by similarity, you will need to convert that object into a dense vector. There are different embedding models for for doing this, depending on the use case. CNNs are usually used for images, while PDFs are typically broken down into small “chunks” of text data (roughly a paragraph long in many cases), and then embedded using sentence transformers. As for the metadata, you can upload them into Pinecone alongside your vectors, and later use them for metadata filtering. You do not need to embed metadata fields.

  2. Use the metadata filter, like so:

query_response = index.query(
    queries=[
        (vector, {"firstName": {"$eq": ["John"]}}), // Replace vector with any vector embedding.
    ],
    top_k=10,
    include_values=True, // Optional. Indicates whether vector values are included in the response.
    include_metadata=True // Optional. Indicates whether metadata is included in the response as well as the ids.
)

The query() method does require a query vector. For this case, since you only care about the metadata filter, you can use any vector your want or even use a dummy value like [0,0,0, …]. Just be sure the length of that array (ie, the dimensionality of the query vector) exactly matches the dimensions of other vectors in the index.

See the query() API reference for details.

  1. Yes, Pinecone is a fully managed service. We keep things running smoothly and securely so you don’t have to worry about the infrastructure. We also need the user to employ a sufficient number of replicas. On the Standard Plan, Pinecone use anti-affinity so that replica pods are spread across availability zones. In the event of a failure, replicas will take up the traffic. But the remaining replicas must have sufficient capacity to handle the throughput. Customers who require an SLA for availability should consider the Dedicated plan, and contact us to talk about their requirements.

  2. For uploading/indexing large amounts of data we recommend using the gRPC client and parallel upserts. The Java client is not available yet. However, our API uses the OpenAPI standard so anyone can build clients on top of it.

  3. Yes, semantic similarity models like those from the sentence transformers library allow you to map semantically similar words and phrases to more similar spaces. So when searching for “pizza” similar items like “Pizza Hut”, “Dominoes”, “pizza restaurant”, or “pizzeria” return higher similarity scores for correctly trained models. You can then use metadata filtering to show only transactions for that customer, or within a certain timeframe, or under a certain amount, and so on…

  4. We monitor and take care of index health. If you try to add more data than an index can hold (roughly 1M 768-dim vectors on p1 pods and 5M on s1 pods, as of this writing) you will get an error. Just create an index with enough pods to hold your data, and that’s it.

  5. Index management is quite simple: You can create, delete, and describe an index. More management and monitoring options are coming soon. As a managed service, we take care of things like monitoring, fault-tolerance, failure recovery, availability (see note about replicas above), security, and so on.

These are great questions!

Feb 8: Edited to say p1 and s1 pods hold 1M and 5M vectors, respectively, and not 1GB and 5GB of vectors.

1 Like

Hello Greg,

Thank you for the response. Could you please share you please share your thoughts on below follow up questions as well.

  1. If we want to store the below data in pinecone, then do we need to create vector embeddings for the values (John, Smith,etc…) to insert it in to pinecone?

{
“first Name” : “John”,
“last Name” : “Smith”,
“address1” : “7631 Wildwood”,
“address 2” : “APT 1234567”,
“state” : “Utah”,
“country” : “USA”,
“zip” : “940958465”,
“time_stamp”: 0
}

  1. Also when we are retrieving the records, do we need to convert the input parameters to vector embeddings and need to pass it to pinecone? But you mentioned that we can pass dummy value like [0,0,0, …]. If we pass the dummy value then may I know how pinecone will identify the records without vector embeddings. Because we may need to store billions of records in some indexes…

  2. Our scenario is, first we will load huge amount of existing data into pinecone. Then we will get incremental data (millions) on daily basis and we need to insert them all into pinecone. So how do we need to create the indexes to accomodate future data, because we need to define the number of dimensions, pods, replicas, etc… We may need to increase the dimensions and pods at later point. Is it possible? How to manage these scenarios?

1 Like

Hey @sprabakar01, whatever you want to search semantically you have to turn into a vector embedding.

If you want to semantically search through transaction descriptions (eg, "transaction_description":"Pizza Hut"), that’s what you turn into a vector embedding and that’s the primary thing that Pinecone indexes. All the other data, such as customer details and transaction amounts, can be added as metadata for filtering purposes.

Then, to semantically search through the descriptions, you also turn the query (Pizza) into a vector embedding and query the Pinecone index. If you want to limit your search results by customer or any other metadata, then you apply filters to the query.

The dummy suggestion was for returning all items that match the filter only, without a query. For example, if you wanted to show all transactions from a particular user before any query is submitted.

So how do we need to create the indexes to accommodate future data, because we need to define the number of dimensions, pods, replicas, etc… We may need to increase the dimensions and pods at later point. Is it possible? How to manage these scenarios?

At the moment you must define the number of pods when creating the index. Some users overprovision so they have room to add vectors, then as they approach the index limit they create a second larger index and swap. We’re actively working on making it easier to scale an index up or down without having to swap. In the meantime, we can help you make the swap.

Our usual guideline is that each p1 pod holds around 1M of 768-dim vectors, and each s1 pod holds around 5M of 768-dim vectors. However, if you plan to index billions of vectors you should talk to us for special configuration and pricing.

Thank you for the clarifications. We will post if anything else.

1 Like

Just +1 to the question about the IDs here^
Should we generate a UUID for them or… what’s the best practice?
thx

Hi @bigrig

Depends on what you might use the ids for. If you want to store some unique information about the vector itself so you can fetch them on demand without querying another vector (lets say you are storing book pages in the Pinecone and know you want to show the first page - Fetch vector with id book-1).

If you are not using ids for anything then UUIDs are the thing that is most commonly used as far as I’ve seen. Most of the libraries use the uuid (eg. langchain).

Hope this helps

1 Like