We are also using hybrid search Retrieval with single hybrid indexing and has used hybrid score normalization method similar as you.
These are the issues that we have been facing when we set the alpha < 0.5:
-
When the alpha value is set to less than 0.5, the relevance of the score should ideally be more towards keyword search but it is always returning the top chunks based on the semantic match.
-
Even if there is no keyword match in the chunk, it returns that chunk on top with higher score, compared to the chunk having the keywords. Hence giving more weight towards semantic.
-
There are chunks which have some semantic matching and also keywords from the query but they are still ranked lower to the chunks with just the semantic matching.
We are using BM25 encoding to embed sparse vectors and text-embedding-3-large to embed dense vectors with single hybrid index.
To incorporate a linear weighing to both vector types, we use this approach
def hybrid_score_norm(dense, sparse, alpha: float):
“”"Hybrid score using a convex combination
alpha * dense + (1 - alpha) * sparse
Args:
dense: Array of floats representing
sparse: a dict of `indices` and `values`
alpha: scale between 0 and 1
"""
if alpha < 0 or alpha > 1:
raise ValueError("Alpha must be between 0 and 1")
hs = {
'indices': sparse['indices'],
'values': [v * (1 - alpha) for v in sparse['values']]
}
return [v * alpha for v in dense], hs
So tuning the hybrid search with alpha values < 0.5, isn’t working well for us. The relevance of the results is not satisfactory.