Understanding BM25 Parameters and Hybrid Search Logic with Sparse-Dense Vectors in Pinecone

jhyim · March 19, 2024, 10:05am

Hello Pinecone Community,

I’m currently exploring advanced search capabilities within Pinecone and have a couple of inquiries that I hope to gain deeper insights into:

BM25 Parameters in BM25Encoder: I’m interested in the specifics of the BM25Encoder in pinecone_text, particularly the k and b parameters. Could anyone share what the default values are for these parameters?
Hybrid Search Logic Using Sparse-Dense Vectors: Moving on to hybrid search strategies, I am curious about the methodology behind integrating sparse and dense vectors, especially regarding the computation of BM25 scores in such scenarios. How does Pinecone handle this blend, and what is the logic behind the BM25 score calculation? If normalization is used, what kind is employed?

Any insights or detailed explanations on these topics would be greatly appreciated, as they would significantly contribute to optimizing my implementation and understanding of Pinecone’s search functionalities.

Thank you in advance for your time and help.

patrick1 · March 21, 2024, 10:34pm

Hello @jhyim and welcome to the community.

pinecone-text is opensource so feel free to have a look at the code yourself, if you like.

I’m interested in the specifics of the BM25Encoder in pinecone_text, particularly the k and b parameters. Could anyone share what the default values are for these parameters?

b: Controls document length normalization (default: 0.75).
k1: Controls term frequency saturation (default: 1.2).

You can change the defaults - bm25-parameters

jhyim · March 22, 2024, 12:29am

Hi @patrick1
Thank you for the detailed response.
I appreciate you sharing the default values for the BM25Encoder parameters.
Thanks again for taking the time to help me out!

patrick1 · March 22, 2024, 11:13am

Anytime! On your second point, if you have not already, I would read over hybrid search intro blog.

The GitHub repo touches on this at the bottom, but the blog explains it well.

If you do need more details please let us know.

joeykees · March 25, 2024, 1:22pm

I am also interested in the “fusion” techniques used to combine sparse and dense vectors. Is it a linear combination, or is RRF(Reciprocal ranked fusion) or some other algorithm? Simply put: How do the two results get combined into one?

Edit:
The blog mentioned by Patrick only mentions an alpha/weighted parameter. Any deeper insight into this how the calculation is done?

patrick1 · March 26, 2024, 11:38am

It’s uses a linear combination also know as convex combination. Here is the function in the code.