After using the text embedding-3-large model, the document score returned by my knowledge base significantly decreased

wangguoqing10110001 · March 16, 2024, 4:48am

After switching our application from Ada002 to the latest embedding-3-large model, the score returned by pinecone significantly decreased, which resulted in the threshold I previously set being completely invalidated.

Firstly, we reconstructed the index data based on the latest embedding model, and the query vectors and document vectors were generated based on the same embedding model.

Then I calculated the normalized scores of cosine, Euclidean distance, and dot product by combining the vector of the query text with the vector returned from the pinecone search results, but the results surprised me.

The score returned by Pinecone is almost the same as the score generated by my dot product calculation. It can be basically clarified that the score of pine cones is calculated based on dot product. Although I set METRIC to cosine in the index, it seems to have no effect because after comparing the scores returned by the documents I retrieved from the index with the scores I calculated myself, the results are clearly based on dot product calculation. I would like to ask what is the reason for this? Or is there any way for me to clearly specify my document score calculation strategy when querying?

wangguoqing10110001 · March 16, 2024, 5:31am

@gdj0nes Hi, Can you help me?

gdj0nes · March 16, 2024, 2:16pm

I believe OpenAI normalizes the embedding vectors to unit length which will mean the dot product and cosine similarities will be the same.