BM25 sparse encoding for hybrid search

SergioEanX · August 18, 2024, 10:34am

I need a clarification on how to correctly use BM25 sparse encoding.
Suppose to create sparse and dense embedding for text extracted from PDF to perform hybrid search.
I initially start with three PDF only and, for simplicity, suppose that this results into an initial corpus as:

corpus_example = [
“The IEMAP (Italian Energy Materials Acceleration Platform) project”,
“intends to create an experimental and computational infrastructure”,
“for the accelerated design and selection of advanced materials for energy”]

To compute embedding I have to:

fit the encoder on corpus
use method .encode_documents onto the fitted encoder to compute the sparse vector
this sparse vector is a dictionary with keys: “indices” and “value” where the ‘indices’ correspond to the unique identifiers of words in the vocabulary extracted from corpus, while the ‘values’ represent the importance of each word based on the BM25 algorithm.

I came to the following function:

def get_bm25_embeddings_using_pinecone(corpus: List[str], text_chunks: List[str]) → List[dict]:
“”"
Generate BM25 sparse embeddings for a list of text chunks after fitting the BM25 encoder to the corpus.

Args:
    corpus (List[str]): A list of text chunks to fit the BM25 encoder.
    text_chunks (List[str]): A list of text chunks to be embedded.

Returns:
    List[dict]: A list of BM25 sparse embeddings for each text chunk in the format {indices, values}.
"""
# Initialize the BM25 encoder
encoder = BM25Encoder()

# Fit the BM25 encoder to the corpus
encoder.fit(corpus)

# Generate BM25 embeddings for each text chunk after fitting
bm25_embeddings = [encoder.encode_documents(chunk) for chunk in text_chunks]

return bm25_embeddings

Then to create embedding for first three documents I have to invoke the function with the previously defined corpus and pass as text_chunks each sentence of the initial corpus.
Suppose that later in time I want to add a new document with words not present in the initial corpus, how I have to proceed?
I have to add new document to initial corpus and refit the encoder? Can I simply use the same encoder fitted only onto the initial corpus?

kennyliaobiz · August 18, 2024, 8:53pm

Hey, I’m also working on a hybrid search implementation with pinecone.

I have to add new document to initial corpus and refit the encoder? Can I simply use the same encoder fitted only onto the initial corpus?

Short answer: Yes, you have to refit the encoder because it’s only storing the tokens which were available in the corpus during the initial fitting.

For my application, I have a corpus of 100K+ texts. I am adding additional documents every week. A simple solution is to simply re-fit the encoder every so often when you’ve added a significant amount of new texts.

Another option is to tokenize your new documents and check if there are any new tokens that don’t yet exist in your encoder, and then re-fit.

Yet another option, which I haven’t tried yet but I think should work fine. Is to

export your bm25 model to JSON after you’ve initially trained.
tokenize your new texts.
check for new tokens that don’t exist in your original JSON
then simply add those new tokens to the JSON and load this updated model.

I think the last method would be the most efficient for scale so you don’t have to re-fit over the entire corpus each time you want to update.

SergioEanX · August 19, 2024, 11:01am

Thanks for your reply. Serialize the fitted BM25 is probably the way to go, perhaps using a dedicated index in Pinecone (serialize BM25 in metadata field and use a mocked dense vector of very small values). Instead of re-fitting every time a single new token is added, one can set a threshold, say 50 tokens, and refit after detecting that this threshold is exceeded.