Does Pinecone allow for training the semantic search?

Can I train the underlying language models used for the semantic search so that the returned documents are a better match to my query term? Or am I bound to use the already trained language models?

1 Like

Hey @zaradana , Pinecone does not have any language models built-in. So you’re free to use any models you want (trained to your liking) to generate your vector embeddings. Once you have the vectors, you can upload them into Pinecone. We support any vector dimensionality up to 20,000, which is enough to work with vectors from the vast majority of models, even the latest ones from OpenAI (dim = 12,000).

We do offer three similarity metrics to choose from (euclidean, dotprocut, cosine), which would affect the quality of your results.

2 Likes

Hey @greg, related to the previous question, how straightforward would it be to change the model once the embeddings are inserted? I imagine this could happen fairly frequently as better models are trained.

Fairly straightforward, @dmlls: Just create a new index and upload the new vectors there. Or partition the existing index into namespaces, one for each model/version you’re using (eg, namespace='all-mpnet-base-v2', namespace='all-distilroberta-v1'). Then you could limit your queries to a specific namespace:

index.query(queries=[[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]],
            top_k=1,
            namespace='all-distilroberta-v1')

The namespacing would only work if the new model has the same number of dimensions as the previous model. If the dimensions are different, you’ll need to create a separate index.

Yet another option is to store the model name/version as metadata along with each vector. Then filter queries by that field.

1 Like

I see thanks! Didn’t know about namespaces, that’s pretty handy :ok_hand:t2: