Indexing and Querying resources in mixed language

wim.vandebrug · April 16, 2024, 5:41pm

I am developing a RAG solution. For a certain topic in healthcare I have a large set of digital books written in both Dutch and English. I have ingested these books into a Pinecone vector database. When querying the vector database in Dutch, I only get reference to Dutch docs and when querying in English, I only get reference to English docs.
In a previous version I used FAISS instead of Pinecone, and query results were in mixed language.
Is Pinecone indexing language sensitive/dependant?

gdj0nes · April 16, 2024, 5:55pm

Results between Pinecone and FAISS should be very similar. Did anything change such as the distance metric or some pre-processing steps?

wim.vandebrug · April 17, 2024, 7:02am

Apart from running a new chunking and indexing process using Pinecone on a extended dataset, nothing else changed.
However, reading through some articles on internet, my guess is that Faiss translates chunked documents into English before embedding and ingesting.