I need a clarification on how to correctly use BM25 sparse encoding.
Suppose to create sparse and dense embedding for text extracted from PDF to perform hybrid search.
I initially start with three PDF only and, for simplicity, suppose that this results into an initial corpus as:
corpus_example = [
“The IEMAP (Italian Energy Materials Acceleration Platform) project”,
“intends to create an experimental and computational infrastructure”,
“for the accelerated design and selection of advanced materials for energy”]
To compute embedding I have to:
- fit the encoder on corpus
- use method .encode_documents onto the fitted encoder to compute the sparse vector
- this sparse vector is a dictionary with keys: “indices” and “value” where the ‘indices’ correspond to the unique identifiers of words in the vocabulary extracted from corpus, while the ‘values’ represent the importance of each word based on the BM25 algorithm.
I came to the following function:
def get_bm25_embeddings_using_pinecone(corpus: List[str], text_chunks: List[str]) → List[dict]:
“”"
Generate BM25 sparse embeddings for a list of text chunks after fitting the BM25 encoder to the corpus.Args: corpus (List[str]): A list of text chunks to fit the BM25 encoder. text_chunks (List[str]): A list of text chunks to be embedded. Returns: List[dict]: A list of BM25 sparse embeddings for each text chunk in the format {indices, values}. """ # Initialize the BM25 encoder encoder = BM25Encoder() # Fit the BM25 encoder to the corpus encoder.fit(corpus) # Generate BM25 embeddings for each text chunk after fitting bm25_embeddings = [encoder.encode_documents(chunk) for chunk in text_chunks] return bm25_embeddings
Then to create embedding for first three documents I have to invoke the function with the previously defined corpus and pass as text_chunks each sentence of the initial corpus.
Suppose that later in time I want to add a new document with words not present in the initial corpus, how I have to proceed?
I have to add new document to initial corpus and refit the encoder? Can I simply use the same encoder fitted only onto the initial corpus?