Dears,
What is the best embedding model for Arabic Data sets, as the current answers that I get from my “chat with your website” LLM application are not correct?
I am using currently
1- “text-embedding-ada-002” as embedding model
2- pinecone as an embedding vector store with cosine similarity to get best context to answer the query
3-‘gpt-3.5-turbo-instruct’ as my model to get answer on my query and its related context
The exact problem is the following ( it is in Arabic but I will explain it in English):
I asked " in the “personal data protection law”, what is the detail of term-no Nine "
based on cosine similarity, the best first three chunks will be term-no (Nineteen 19, twenty-nine 29 , thirty-nine 39 ) and so the answer from GPT will be wrong as the total context is wrong.
based on cosine similarity score, the right chunk which is term-no 9 is the sixth chunk and I am using the answer from the first 3 chunks.
Is there a tokenizer that is Arabic oriented
is there embedding model other than ada-002 that is also Arabic oriented
if yes, what is it and how please to get its API to use it.
if changing the embedding/Tokenization model is not the right solution for this problem, can you propose any other solutions please.
AraVec: A paper and associated GitHub repo with code described as “…a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models”
An article on medium that explores fine-tuning RAG systems to improve Arabic embeddings
In addition, it would also help if you could share your relevant code, either here in the forum post or by linking to your repo if it’s open-source.
However Regarding your particular problem المادة 9 I faced a similar problem. well semantic embeddings doesn’t help in this kind of search , as the numeric identity 9 is dissolved in the embedding context , so is unable to sharpen the hits , so you may find the results in later (k) hits . if this type of requests is frequent ,i.e asking for items numbers , so you either resort to a tabular structure (dataframe with separated item numbers) for SQL type of queries or use a hybrid Lexical(BM25) + semantic search to catch such queries …