I am currently developing a chatbot application utilizing RAG for the digital transformation of enterprises. I have encountered limitations in the accuracy of similarity search using only semantic search. For example, when searching for information about the operating profit of Company A, chunks related to the operating profit of Company B are ranked higher.
To address this issue, I am investigating hybrid search. However, the hybrid search is not performing as expected. Specifically, even when alpha is set close to 1, the search results are almost identical to those of keyword search.
I have two questions:
- Is the following process for preparing hybrid search correct?
- Are there any considerations for performing high-accuracy hybrid search? (e.g., how much data is required to create a high-accuracy BM25 model, and is it better to prepare data from various domains?)
Current process (I am conducting verification using Japanese documents):
- Tokenize all documents using a Japanese tokenizer (e.g., Sudachpy).
- Fit a BM25 model using the created tokens as a corpus.
- Vectorize each chunk using the fitted BM25 model and upsert the vectors to the index.
- Vectorize the search query using the BM25 model and perform a similarity search.
The versions of the libraries I am using are as follows:
pinecone-client==2.2.4
pinecone-text==0.8.0
Thank you for your time and consideration.