Score threshold to filter out irrelevant results


I’ve recently been thinking about how to make sure we only return “sufficiently relevant” results to our users / downstream services.

The number of results returned from a query is the minimum of top_k and the number of documents matching the metadata. This is true no matter the actual relevance of those documents.

Let’s assume top_k = 10. If a user submits a query that closely matches many documents, they will get back 10 great results - awesome :tada:! However, if they submit a query that doesn’t closely match any documents, they will get back 10 irrelevant results - disappointing, confusing, and breaks down trust :confused:. For downstream services (e.g. in a retrieval-augmented chatbot), this is also not ideal.

Each result has a similarity score. Ok, so we can discard results with a similarity score below some threshold. But… how do you determine the value of that threshold?

Back in the days of pure dense embeddings, we used a SentenceTransformer trained with cosine similarity. Cosine / dot product similarity between any embeddings generated by such a model fall within the interval [-1, 1]. So we transformed the score into the unit interval with score + 1 / 2 and interpreted it as “% similarity” (bit dodgy, I know - but you can show pretty bars to users!!). Then I just played around with the threshold and found that for our particular model and data, a value of 0.7 seems to work pretty well.

Then came hybrid / sparse-dense embeddings. Btw, they’re pretty awesome! We kept the cosine-trained SentenceTransformer and opted to use SPLADE for our sparse embeddings (because hype!!). Little did I realize that SPLADE was trained with dot product similarity, and therefore returns non-unit sparse vectors. So SPLADE similarity scores are not constrained to the interval [-1, 1]. (Instead they fall in an interval [0, some_number_definitely_bigger_than_1] given that values are never negative - correct me if I’m wrong.) So that threshold of 0.7 no longer works.

I’m now trying to figure out an appropriate threshold for queries on our sparse-dense index. Initially I was thinking of normalizing the SPLADE embeddings, but I have been informed this is not a good idea. I think it’s further complicated by the fact one can vary the alpha parameter, which changes the contributions of the dense and sparse components. Set alpha = 1 (aka pure dense retrieval) and threshold = 0.7 should yield the same result. Set alpha = 0 (aka pure sparse retrieval) and I suppose I could figure out an appropriate threshold using the previous approach. Is it reasonable to then interpolate between threshold_sparse and threshold_dense using the same convex combination logic:

threshold_hybrid = threshold_sparse * (1 - alpha) + alpha * threshold_dense

This is how I’m currently thinking about the problem, but my reasoning is likely flawed as I’m not an expert. Would really love to gather some insights from people who actually understand this stuff!



Indeed, when using SPLADE, the scores are not normalized, and normalization is not generally recommended. However, any solution you can think of may work empirically.

You could consider dividing your SPLADE embeddings by the norm of your query. In this approach, the ranking you obtain will remain unchanged, but a threshold may perform better with this adjustment. Another option is to train a dedicated classifier that will operate on your top-k results for this purpose (predicting relevance with calibrated logits) or use logits from a re-ranker/reader model. In fact, using a re-ranker may also improve the ranking itself.

1 Like