Training Sentence Transformers with MNR Loss

Transformer-produced sentence embeddings have come a long way in a very short time. Starting with the slow but accurate similarity prediction of BERT cross-encoders, the world of sentence embeddings was ignited with the introduction of SBERT in 2019 [1]. Since then, many more sentence transformers have been introduced. These models quickly made the original SBERT obsolete.

How did these newer sentence transformers manage to outperform SBERT so quickly? The answer is multiple negatives ranking (MNR) loss.

This is a companion discussion topic for the original entry at

Thanks for the interesting and useful article!
We applied the Fast Fine-Tuning section on our data and got some nice results.
Why did you use only one epoch? do you recommend to increase the number of epochs for getting even better results?

1 Like

That’s great! It depends on your data and use-case but when you have a pretrained transformer, you can usually fine-tune on just one epoch and get optimal performance. For a lot of the sentence transformer models this is the standard but not always the case, so it’s best to experiment.

Thanks! we may try other configurations.

‫בתאריך יום ג׳, 12 ביולי 2022 ב-18:54 מאת ‪James via Pinecone Community‬‏ <‪‬‏>:‬

Pretty helpful guide.Thank you for sharing. Couple of questions

  1. When we are fine tuning with sentence-transformers, we are not explicitly training a FF network. Is that done in the backend in
  2. What would be our approach when numbers are present. For example,
    “A’s height is 1 ft more than B’s height” compared to
    “A’s height is 2 ft more than B’s height”
1 Like

Hey Prashant, good questions:

  1. Yes all of the transformers + added layers are added automatically by the sentence-transformer library and trained in the backend of
  2. Most training data wouldn’t suit numerical comparisons well, and the model would probably not give the desired results. To get good results with this type of comparison, you would need numerical comparisons to be represented in your training dataset, and then the model should perform better.

I hope that helps, let us know if you have anymore questions!

Thank you for the insight, James.
At the moment, can you point any limitations of MNR in Sentence Embedding?
What can you say is the future of MNR?

Hi Pezaro, yes, MNR has a few limitations, primarily:

  • Needs hard negatives for best performance, which is harder to gather (than just positive pairs)
  • Other methods with more granular scores (like cosine sim loss) can produce better performance than just positives vs. negatives (as with MNR) as there is a continuous range of similarities, enabling more nuance to be captured
  • You still need to gather positive pairs, and the data should be structured in that you’re unlikely to end up with more than two possible positives in a single batch, as this will cause issues during training. An example would be relying on clusters to identify positive pairs, where you have few clusters in the dataset.

In the future, as a guess, I think training will improve further through multi-modality; we see this already with models like CLIP. Beyond that, judging from the general direction of NLP models; we may eventually see training methods utilizing reinforcement learning techniques.