Training Sentence Transformers

jamesbriggs · January 27, 2022, 2:01am

q1) In the course Unsupervised Training for Sentence Transformers | Pinecone you have explained 3 training methods - Training Sentence Transformers with Softmax Loss , Training Sentence Transformers with MNR Loss and How TSDAE Works…there is also a way to optimize CosineSimilarityLoss (sentence-transformers/training_stsbenchmark_continue_training.py at master · UKPLab/sentence-transformers · GitHub) …

which one is the best? or when should each of them be used? …please educate

q2) Why pooler is required on the page Training Sentence Transformers with MNR Loss | Pinecone? is your sole goal to reduce length on sentence embedding using the pooler? Could we keep only bert = models.Transformer('bert-base-uncased') part? in that case would sentence embedding longer than after using pooler? I came across this Selecting layer with SBERT model - Python sentence-transformers and now I am more confused about pooling. What is its purpose?

from sentence_transformers import models, SentenceTransformer
bert = models.Transformer('bert-base-uncased')
pooler = models.Pooling(
    bert.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True
)
model = SentenceTransformer(modules=[bert, pooler])
model

q3) why didnt you use all_datasets_v3_mpnet-base and rather used bert-base-uncased? isn’t all_datasets_v3_mpnet-base better as it is specially designed for finding sentence embedding?

q4) I understand that EmbeddingSimilarityEvaluator takes 2 sentences, finds their embedding and return cosine similarity scores and then some additional processing. would we get similar results if we use model.encode() and pass two sentences and get sentence embedding?

appreciate your guidance on these …your inputs have been really useful

Original question

jamesbriggs · January 27, 2022, 2:02am

Q1) It depends on your input data, for NLI data MNR loss is usually best (but Softmax is an option). For data with just positive samples, MNR loss is best again (and Softmax is not and option). If you don’t have labeled data, TSDAE is ideal. CosineSimilarityLoss is good when you have continuous similarity scores.

Q2) Using only a transformer model (as with models.Transformer('bert-base-uncased') ) will output token embeddings, in the case of BERT that is 512 embeddings, each of dimensionality 768. These 512 embeddings cannot be used as a single sentence embedding because they are not a single embedding, so we pool (eg average in mean pooling) all 512 token embeddings into a single 1 sentence embedding (maintaining the vector dimensionality of 768).

Q3) all_datasets_v3_mpnet-base is the transformer microsoft/mpnet-base fine-tuned for finding sentence embeddings using (most likely) the MNR loss training. I use bert-base-uncased in many examples as it has not been fine-tuned yet. So it is like starting with a blank slate so we can learn how to take a normal transformer model and turn it into a sentence transformer.

Q4) Yes, it is the same as if we used model.encode() to get the two embeddings, then calculated the cosine similarity between the two vectors manually.

Hope that helps!

NI_tempe · January 27, 2022, 2:13am

what is this community? do you own this company…just curious and thanks for answering my questions

jamesbriggs · January 27, 2022, 3:00am

Hi you joined us early! The community hasn’t ‘officially’ launched yet, but it will be a forum to discuss all things vector search. I added your question here as I think it was a good question that will help others, unfortunately Udemy Q&A is not found by Google, but this will be

Pinecone is a managed vector database, I work here as a developer advocate putting together articles, videos, and other things like the course - essentially sharing everything I can about vector search. I use Pinecone in the chapter on Q&A retrieval if you’re interested in what we do

NI_tempe · January 27, 2022, 3:40am

you have been super helpful and provided very good guidance…appreciate your inputs and looking forward to learning from you!