Results score cut-off / distance change as search space grows?

I am building an application where I want to limit the results not by an absolute number but instead by their similarity to the query.

From the documentation it seems that the only option is top_k (a fixed number of results that meet a threshold similarity score). I am wondering how that similarity score is determined, and whether it may change as more upserts are indexed.

In other words, if I had an index with lots of statements about industrial safety and determined for my application that similarity needs to be < 0.1 in order to be useful, and then later added a bunch of statements about acrylic painting, would it change the relative score of results such that I would have to adjust my in-app cutt-off from < 0.1 to say <0.15?

Assume that I am getting embeddings from openai and the same model version each time. (the embedding model isn’t changing, I am just asking about the pinecone query results score calculation).

1 Like

Hi @shore,

We have a roadmap item to do this kind of similarity searching (“return all vectors within x distance of this one”), but we don’t have a release date set for it yet. It’s still in planning and development.

For now, top_k is the way to go. As you said, we return a fixed number of vectors going from closest to furthest. There’s no limit on how far a vector can be in the return, so if your top_k is 10 and the 10 closest vectors are all around a distance of .05, those would be the ones returned.

is there an update on this feature?

would love to have this feature too :slightly_smiling_face:

2 Likes

+1, Definitely something cool to have!

1 Like