Cross-Entropy Loss

Have you ever wondered what happens under the hood when you train a neural network? You’ll run the gradient descent optimization algorithm to find the optimal parameters (weights and biases) of the network. In this process, there’s a loss function that tells the network how good or bad its current prediction is. The goal of optimization is to find those parameters that minimize the loss function: the lower the loss, the better the model.

This is a companion discussion topic for the original entry at

Dear author of the post. Thanks for your work.

Unfortunately, there are two possible mistakes. But does it mistake depends on your math culture. If you want to hangout with people who know/use math, it’s better for you to read things below.

  1. Minor. Cross-Entropy in math rarely uses natural logarithm because CE, KL, and H are connected and use logarithms of not on “2” for Cross Entropy is pretty not traditional, because entropy (H) with based-2 logarithm says about the required number of bits in average that is required to encode the random variable.

An excellent book about Information Theory is Thomas M. Cover book.
[Elements of Information Theory] (
On page 5, the authors explicitly say that logarithm has base 2.

Please, whenever you have learned that exponential should be based on natural logarithm - report for that people that they are wrong…(T.Cover, first of all, is a significant person in the subject, and secondary it’s very unnaturally use any another base, not 2)

Sometimes people use math words incorrectly - one example is convolution kernel, when in fact it’s cross-correlation and there are another incorrect terms that only confuse people. But that should be mentioned by people who use that.

In any case the logarithms of different bases are connected with multiplicative positive factor, so in the end of the day it does not matter for optimization problem formulation if CE is only one objective function.

  1. Major/Minor. “The cross-entropy between two discrete probability distributions signifies how different the two distributions are.” It does not.

The similarity (or another math term divergence) of two distributions can be measured by KL divergence, which is some notion of distance between two distributions (even KL is not symmetric and does not obey triangle inequality). Still, at least it obeys properties d(p,q)=0 \iff p=q.

And CE does not have that property. Counter-example.
Assume you have two random variables(r.v.) with the same probability mass functions (p.m.f.) [0.2,0.2,0.2,0.2]
You can compute that CE for them 2.3219 (if you use base-2 logarithm) it’s not 0, even p.m.f. are identical and feasible.

Thanks. Konstantin Burlachenko (

p.s. I also have started with such remarks and they helped me a lot in the past to structure my thinking. But if you want mix code, text, math it’s better to have a nice platform to express yourself. Maybe if that platform does not allow you to insert formulas nicely with latex - it’s not for you.

  • Maybe start writing in Latex immediately and post compiled PDF files instead of?
  • Or Maybe find the website that support insertion of code-snippets and math equations and pictures easily?

Good luck.

Hello Konstantin,
Thank you for reading through the tutorial and for taking the time to leave your detailed feedback.

To address the raised concerns:

  1. Yes, I’m aware that in theory, we use log to the base 2. However, as you know, PyTorch and TensorFlow implementations of the binary and categorical cross-entropy losses use natural logarithm (base e). That’s the only reason we’ve chosen to use natural logarithm over base 2 logarithm. Also, the two quantities are merely scaled versions of each other as you can switch from one base to the other when using log.

  2. Yes, KL Divergence and total variation distance (d_TV = 0.5* L_1 norm) both capture how close two distributions are. However, we do know that cross-entropy and KL divergence are identical up to a constant factor (H(p,q) = H(p) + KL(p||q) #H(p) the entropy of the true distribution evaluates to a constant). The post has been updated for contextual clarity. Eventually, we would like the reader to intuitively understand how cross-entropy as a loss function works by penalizing estimated probabilities that are far from the true probabilities—which I believe has been communicated.

Thanks again for your feedback.


1 Like