Results score cut-off / distance change as search space grows?

I am building an application where I want to limit the results not by an absolute number but instead by their similarity to the query.

From the documentation it seems that the only option is top_k (a fixed number of results that meet a threshold similarity score). I am wondering how that similarity score is determined, and whether it may change as more upserts are indexed.

In other words, if I had an index with lots of statements about industrial safety and determined for my application that similarity needs to be < 0.1 in order to be useful, and then later added a bunch of statements about acrylic painting, would it change the relative score of results such that I would have to adjust my in-app cutt-off from < 0.1 to say <0.15?

Assume that I am getting embeddings from openai and the same model version each time. (the embedding model isn’t changing, I am just asking about the pinecone query results score calculation).

1 Like

Hi @shore,

We have a roadmap item to do this kind of similarity searching (“return all vectors within x distance of this one”), but we don’t have a release date set for it yet. It’s still in planning and development.

For now, top_k is the way to go. As you said, we return a fixed number of vectors going from closest to furthest. There’s no limit on how far a vector can be in the return, so if your top_k is 10 and the 10 closest vectors are all around a distance of .05, those would be the ones returned.

1 Like

is there an update on this feature?

would love to have this feature too :slightly_smiling_face:

2 Likes

+1, Definitely something cool to have!

1 Like

+1, Any update on this.

For this proposed feature, would top_k=10000 suffice, or do you want to scan the whole index and return all results within a certain similarity?

If the latter, can you please share some color on why and how you intend to utilize potentially > 10k results?

Im thinking of seriously switching to Weaviate just because of this one thing. This has been requested by multiple users multiple times everywhere on the community, on stack overflow and otherwise. If you check your customer service tickets they will mention this as one of the top items too.

So when will we get this feature of returning vectors with not just top_k but alsi above a certain threshold. The power of Pinecone is vector search and if i cannot do it well or have to do post processing then why even bother with it.