Should I preprocess my query?

Adam · March 28, 2023, 8:34pm

I’m quite new to the concepts around similarity search and vector dbs, please forgive me!

I’m building a Q&A app where users can ask questions on a large set of tax-related documents. I’m using an OpenAI embeddings model to perform a similarity search on my Pinecone index, however it seems like I could improve this process to find more relevant documents.

One consideration I’ve had is whether I should query documents using the entire question (e.g. “How do I apply for tax deduction on my charity donations?”) or if I should preprocess the question in some way before passing it to the embeddings model, for example by reducing it to some keywords instead (e.g. “apply, tax deduction, charity, donations”).

In my experiment I’ve had some success, but I’m still not convinced. I wish to hear from someone who has a bit deeper understanding of what’s going on “under the hood” if this approach is worth further evaluation.

Kind regards,
Adam

Jasper · March 30, 2023, 7:01am

HI!

I have some experience with this kind of work, but found the opposite to be true. When using these LLMs the whole query provided the most information with which to search (legal domain). When trimming the whole query to keywords you are essentially taking away the semantic search information I think. Why not just search the tax documents by these keywords and their synonyms (if they exist)?

What I would look into is how you are embedding the documents. How big are the chunks you created vectors from. What is the idea of what do you want to return? Do you want the Q&A to answer the question exactly or do you want to provide the user with the document/paragraph? Splitting into shorter texts proved beneficial for me as the shorter text provided vectors that were better for querying specific queries/questions.

What I’ve done is split the legal documents to articles or paragraphs then embedded those shorter texts. From there I went two ways, one was returning the text that was embedded, the other was taking the best found vector and its text and fed that text into ChatGPT as a prompt saying something like “You are a legal Q&A bot. Answer the question below from the given text. {context} {question}”.

Hope this helps.