How to improvement of hybrid search accuracy?

wada.yuta · March 8, 2024, 9:44am

I am currently developing a chatbot application utilizing RAG for the digital transformation of enterprises. I have encountered limitations in the accuracy of similarity search using only semantic search. For example, when searching for information about the operating profit of Company A, chunks related to the operating profit of Company B are ranked higher.

To address this issue, I am investigating hybrid search. However, the hybrid search is not performing as expected. Specifically, even when alpha is set close to 1, the search results are almost identical to those of keyword search.

I have two questions:

Is the following process for preparing hybrid search correct?
Are there any considerations for performing high-accuracy hybrid search? (e.g., how much data is required to create a high-accuracy BM25 model, and is it better to prepare data from various domains?)

Current process (I am conducting verification using Japanese documents):

Tokenize all documents using a Japanese tokenizer (e.g., Sudachpy).
Fit a BM25 model using the created tokens as a corpus.
Vectorize each chunk using the fitted BM25 model and upsert the vectors to the index.
Vectorize the search query using the BM25 model and perform a similarity search.

The versions of the libraries I am using are as follows:
pinecone-client==2.2.4
pinecone-text==0.8.0

Thank you for your time and consideration.

ayan.s · March 13, 2024, 10:25am

Hybrid search for Japanese is sensitive to chunking and tokenization.
Make sure your chunk sizes are big enough to be meaningful, otherwise you will end up vectorizing at a word level. That will explain why your search results are almost identical to those of keyword search.

wada.yuta · March 14, 2024, 1:15am

Thank you for your reply.

We are currently validating with 1,000 characters per chunk.
I know it depends on the content of the documents, but in your experience, do you know how many characters would be appropriate for Japanese?