Heya,

I’ve recently been thinking about how to make sure we only return “sufficiently relevant” results to our users / downstream services.

The number of results returned from a query is the minimum of `top_k`

and the number of documents matching the metadata. This is true no matter the actual relevance of those documents.

Let’s assume `top_k = 10`

. If a user submits a query that closely matches many documents, they will get back 10 great results - awesome ! However, if they submit a query that doesn’t closely match any documents, they will get back 10 irrelevant results - disappointing, confusing, and breaks down trust . For downstream services (e.g. in a retrieval-augmented chatbot), this is also not ideal.

Each result has a similarity score. Ok, so we can discard results with a similarity score below some threshold. But… how do you determine the value of that threshold?

Back in the days of pure dense embeddings, we used a SentenceTransformer trained with cosine similarity. Cosine / dot product similarity between any embeddings generated by such a model fall within the interval `[-1, 1]`

. So we transformed the score into the unit interval with `score + 1 / 2`

and interpreted it as “% similarity” (bit dodgy, I know - but you can show pretty bars to users!!). Then I just played around with the threshold and found that for our particular model and data, a value of `0.7`

seems to work pretty well.

Then came hybrid / sparse-dense embeddings. Btw, they’re pretty awesome! We kept the cosine-trained SentenceTransformer and opted to use SPLADE for our sparse embeddings (because hype!!). Little did I realize that SPLADE was trained with dot product similarity, and therefore returns non-unit sparse vectors. So SPLADE similarity scores are not constrained to the interval `[-1, 1]`

. (Instead they fall in an interval `[0, some_number_definitely_bigger_than_1]`

given that values are never negative - correct me if I’m wrong.) So that threshold of `0.7`

no longer works.

I’m now trying to figure out an appropriate threshold for queries on our sparse-dense index. Initially I was thinking of normalizing the SPLADE embeddings, but I have been informed this is not a good idea. I think it’s further complicated by the fact one can vary the `alpha`

parameter, which changes the contributions of the dense and sparse components. Set `alpha = 1`

(aka pure dense retrieval) and `threshold = 0.7`

should yield the same result. Set `alpha = 0`

(aka pure sparse retrieval) and I suppose I could figure out an appropriate threshold using the previous approach. Is it reasonable to then interpolate between `threshold_sparse`

and `threshold_dense`

using the same convex combination logic:

```
threshold_hybrid = threshold_sparse * (1 - alpha) + alpha * threshold_dense
```

This is how I’m currently thinking about the problem, but my reasoning is likely flawed as I’m not an expert. Would really love to gather some insights from people who actually understand this stuff!

Cheers!