Cross-Entropy Loss

Have you ever wondered what happens under the hood when you train a neural network? You’ll run the gradient descent optimization algorithm to find the optimal parameters (weights and biases) of the network. In this process, there’s a loss function that tells the network how good or bad its current prediction is. The goal of optimization is to find those parameters that minimize the loss function: the lower the loss, the better the model.

This is a companion discussion topic for the original entry at

Dear author of the post. Thanks for your work.

Unfortunately, there are two possible mistakes. But does it mistake depends on your math culture. If you want to hangout with people who know/use math, it’s better for you to read things below.

  1. Minor. Cross-Entropy in math rarely uses natural logarithm because CE, KL, and H are connected and use logarithms of not on “2” for Cross Entropy is pretty not traditional, because entropy (H) with based-2 logarithm says about the required number of bits in average that is required to encode the random variable.

An excellent book about Information Theory is Thomas M. Cover book.
[Elements of Information Theory] (
On page 5, the authors explicitly say that logarithm has base 2.

Please, whenever you have learned that exponential should be based on natural logarithm - report for that people that they are wrong…(T.Cover, first of all, is a significant person in the subject, and secondary it’s very unnaturally use any another base, not 2)

Sometimes people use math words incorrectly - one example is convolution kernel, when in fact it’s cross-correlation and there are another incorrect terms that only confuse people. But that should be mentioned by people who use that.

In any case the logarithms of different bases are connected with multiplicative positive factor, so in the end of the day it does not matter for optimization problem formulation if CE is only one objective function.

  1. Major/Minor. “The cross-entropy between two discrete probability distributions signifies how different the two distributions are.” It does not.

The similarity (or another math term divergence) of two distributions can be measured by KL divergence, which is some notion of distance between two distributions (even KL is not symmetric and does not obey triangle inequality). Still, at least it obeys properties d(p,q)=0 \iff p=q.

And CE does not have that property. Counter-example.
Assume you have two random variables(r.v.) with the same probability mass functions (p.m.f.) [0.2,0.2,0.2,0.2]
You can compute that CE for them 2.3219 (if you use base-2 logarithm) it’s not 0, even p.m.f. are identical and feasible.

Thanks. Konstantin Burlachenko (

p.s. I also have started with such remarks and they helped me a lot in the past to structure my thinking. But if you want mix code, text, math it’s better to have a nice platform to express yourself. Maybe if that platform does not allow you to insert formulas nicely with latex - it’s not for you.

  • Maybe start writing in Latex immediately and post compiled PDF files instead of?
  • Or Maybe find the website that support insertion of code-snippets and math equations and pictures easily?

Good luck.