What is the best Arabic embedding and tokenization models?

What is the best embedding model for Arabic Data sets, as the current answers that I get from my “chat with your website” LLM application are not correct?

I am using currently
1- “text-embedding-ada-002” as embedding model
2- pinecone as an embedding vector store with cosine similarity to get best context to answer the query
3-‘gpt-3.5-turbo-instruct’ as my model to get answer on my query and its related context

The exact problem is the following ( it is in Arabic but I will explain it in English):
I asked " in the “personal data protection law”, what is the detail of term-no Nine "
based on cosine similarity, the best first three chunks will be term-no (Nineteen 19, twenty-nine 29 , thirty-nine 39 ) and so the answer from GPT will be wrong as the total context is wrong.

based on cosine similarity score, the right chunk which is term-no 9 is the sixth chunk and I am using the answer from the first 3 chunks.

Is there a tokenizer that is Arabic oriented
is there embedding model other than ada-002 that is also Arabic oriented
if yes, what is it and how please to get its API to use it.

if changing the embedding/Tokenization model is not the right solution for this problem, can you propose any other solutions please.

Omran Badarneh

Hi @obadarneh99, and welcome to the Pinecone forums! Thanks for your question.

I did a little research and here’s what I found - hopefully some of this is useful to you to get unblocked:

In addition, it would also help if you could share your relevant code, either here in the forum post or by linking to your repo if it’s open-source.

I hope this is helpful!


1 Like

For small compact implementations , Microsoft E5 is the best for Arabic . here is my recent research paper on the Topic : Semantic Embeddings for Arabic Retrieval Augmented Generation (ARAG)
However recently I am doing a project with the new Openai ada3-small and ada3-large , producing very impressive results for Arabic https://thesai.org/Publications/ViewPaper?Volume=14&Issue=11&Code=IJACSA&SerialNo=135

1 Like

However Regarding your particular problem المادة 9 I faced a similar problem. well semantic embeddings doesn’t help in this kind of search , as the numeric identity 9 is dissolved in the embedding context , so is unable to sharpen the hits , so you may find the results in later (k) hits . if this type of requests is frequent ,i.e asking for items numbers , so you either resort to a tabular structure (dataframe with separated item numbers) for SQL type of queries or use a hybrid Lexical(BM25) + semantic search to catch such queries …

Terrific - thank you so much for sharing!