Issue with Balancing Sparse and Dense Embeddings in Hybrid Search

hansal · June 8, 2024, 8:23pm

General Overview: We are currently encountering issues while fetching chunks from the Pinecone database using a hybrid-search approach. The following code snippet illustrates the method we are using to scale our hybrid search query:

def hybrid_scale(self, query, alpha: float):
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    hsparse = {
        "indices": query["sparse_vector"]["indices"],
        "values": [v * (1 - alpha) for v in query["sparse_vector"]["values"]],
    }
    hdense = [v * alpha for v in query["vector"]]
    return {"vector": hdense, "sparse_vector": hsparse}

The Problem: Our pipeline leverages the BM25 model for sparse embeddings and the jinaai/jina-embeddings-v2-base-en for dense embeddings.

Given a user query like: “I want to know more about the features introduced in v1.42,” we aim to retrieve all chunks related to v1.42 that discuss the features introduced.

However, we are experiencing the following issues:

Imbalance in Relevance: When using an alpha value of 0.1, the retrieved chunks are related to feature updates in general, but they do not specifically pertain to v1.42.
Loss of Semantic Relevance: When we decrease the alpha value to 0.01, we do get chunks related to v1.42, but they lack semantic relevance, thus failing to capture the contextual information we need.
Irrelevant Chunks: When using an alpha value greater than 0.1, the retrieved chunks do not discuss v1.42 at all, losing the keyword matching capability entirely.

In essence, adjusting the alpha value slightly leads to a significant trade-off between keyword matching and semantic relevance.

Expected Behavior: We expect the hybrid search to balance both the keyword matching (v1.42) and the semantic relevance (features introduced) efficiently, without compromising one for the other.

Request for Support: We are seeking guidance on how to fine-tune our hybrid search parameters or any alternative approach to achieve a better balance between sparse and dense embeddings. Any insights, suggestions, or examples of how others have tackled similar issues would be greatly appreciated.

Thank you for your support!

ZacharyProser · June 10, 2024, 1:40pm

Hi @hansal, and welcome to the Pinecone community forums!

Thank you for your question.

Hopefully, others will chime in with their experiences, but in the meantime, I wanted to ensure you knew about our example Notebooks repository: GitHub - pinecone-io/examples: Jupyter Notebooks to help you get hands-on with Pinecone vector databases.

We have a ton of example Notebooks there broken out by use case, including hybrid search.

Here are three Notebooks in particular you may wish to review if you haven’t already seen them:

Best,
Zack