Embedding model for hybrid searches?

winston1 · May 31, 2024, 2:33pm

In this doc: Vector Similarity Explained | Pinecone, it says:

The basic rule of thumb in selecting the best similarity metric for your Pinecone index is to match it to the one used to train your embedding model .

I am of the understanding of the following two facts:

OpenAI models are trained on cosine similarity.
Hybrid indexes must use dotproduct metric.

Should I switch to a different embedding model that was trained on dotproduct? Are there any such models? The score variable for reasonably good matches I’ve seen is 30 or 40, which is a weird scale.

Veronica_Pinecone · June 4, 2024, 7:45pm

Hi @winston1, thanks for your post! Yes, a dotproduct-trained embedding model will perform better with a hybrid search, since the dotproduct distance metric is required for hybrid search.

Not all of the OpenAI models are cosine-trained, though. From our Model Gallery:

text-embedding-3-large uses cosine or dotproduct
text-embedding-3-small uses cosine or dotproduct
CLIP uses cosine or dotproduct
clip-ViT-B-32-multilingual-v1 uses cosine or dotproduct
text-embedding-ada-002 uses cosine or dotproduct

Please note the above is a nonexhaustive list.

Here are some additional resources for hybrid search: