What is the best Arabic embedding and tokenization models?

obadarneh99 · May 1, 2024, 8:16pm

Dears,
What is the best embedding model for Arabic Data sets, as the current answers that I get from my “chat with your website” LLM application are not correct?

I am using currently
1- “text-embedding-ada-002” as embedding model
2- pinecone as an embedding vector store with cosine similarity to get best context to answer the query
3-‘gpt-3.5-turbo-instruct’ as my model to get answer on my query and its related context

The exact problem is the following ( it is in Arabic but I will explain it in English):
I asked " in the “personal data protection law”, what is the detail of term-no Nine "
based on cosine similarity, the best first three chunks will be term-no (Nineteen 19, twenty-nine 29 , thirty-nine 39 ) and so the answer from GPT will be wrong as the total context is wrong.

based on cosine similarity score, the right chunk which is term-no 9 is the sixth chunk and I am using the answer from the first 3 chunks.

Is there a tokenizer that is Arabic oriented
is there embedding model other than ada-002 that is also Arabic oriented
if yes, what is it and how please to get its API to use it.

if changing the embedding/Tokenization model is not the right solution for this problem, can you propose any other solutions please.

Regards.
Omran Badarneh

ZacharyProser · May 1, 2024, 8:49pm

Hi @obadarneh99, and welcome to the Pinecone forums! Thanks for your question.

I did a little research and here’s what I found - hopefully some of this is useful to you to get unblocked:

A multilingual model for the Sentence Transformers project on Huggingface called paraphrase-multilingual-MiniLM-L12-v2
AraVec: A paper and associated GitHub repo with code described as “…a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models”
An article on medium that explores fine-tuning RAG systems to improve Arabic embeddings

In addition, it would also help if you could share your relevant code, either here in the forum post or by linking to your repo if it’s open-source.

I hope this is helpful!

Best,
Zack

hazem.azim · May 12, 2024, 6:18am

For small compact implementations , Microsoft E5 is the best for Arabic . here is my recent research paper on the Topic : Semantic Embeddings for Arabic Retrieval Augmented Generation (ARAG)
However recently I am doing a project with the new Openai ada3-small and ada3-large , producing very impressive results for Arabic https://thesai.org/Publications/ViewPaper?Volume=14&Issue=11&Code=IJACSA&SerialNo=135

hazem.azim · May 12, 2024, 6:28am

However Regarding your particular problem المادة 9 I faced a similar problem. well semantic embeddings doesn’t help in this kind of search , as the numeric identity 9 is dissolved in the embedding context , so is unable to sharpen the hits , so you may find the results in later (k) hits . if this type of requests is frequent ,i.e asking for items numbers , so you either resort to a tabular structure (dataframe with separated item numbers) for SQL type of queries or use a hybrid Lexical(BM25) + semantic search to catch such queries …

ZacharyProser · May 13, 2024, 1:04pm

Terrific - thank you so much for sharing!

phptop · January 20, 2025, 6:32pm

Thank you so much, Dr. Hazem