Cross-Entropy Loss
Cross-entropy loss is the standard loss function for classification problems in machine learning. It is not an arbitrary choice — it is a direct consequence of Maximum Likelihood Estimation (MLE). When you minimize cross-entropy, you are mathematically finding the model parameters that make your observed training data the most probable outcome.
I. Background — Where Does It Come From?
Entropy
Entropy (from information theory) measures the uncertainty of a probability distribution. A distribution that is perfectly certain (one outcome has probability 1.0) has zero entropy. A uniform distribution (all outcomes equally likely) has maximum entropy.
Cross-Entropy
Cross-entropy measures the difference between two probability distributions:
— the true distribution (what actually happened) — the predicted distribution (what your model thinks)
When the model's predicted distribution perfectly matches the truth, cross-entropy equals the entropy of the true distribution — its theoretical minimum. Any deviation from the truth increases the loss. This is exactly what we want: a function that is minimized only when predictions match reality.
Connection to MLE
From the MLE derivation, the negative log-likelihood for a classification model is:
Cross-entropy loss is this formula, expressed in information-theoretic language. The two are equivalent. This is why cross-entropy is the principled, correct loss function for classification — not just a heuristic.
II. Multiclass Classification (3+ Classes)
Used when predicting one of three or more mutually exclusive outcomes (e.g., classifying an ETF as Growth, Value, or Dividend).
Step 1 — Representing the Truth: One-Hot Encoding
The ground truth label
Example: For 3 classes and a true label of Value (
Step 2 — Generating Probabilities: Softmax
The model outputs raw scores called logits (
: Raw logit score for class : Total number of classes
Example output:
Step 3 — Computing the Loss: Categorical Cross-Entropy (CCE)
Single Sample
Why this formula is elegant:
Becauseis one-hot encoded, it acts as a selector. Every term where vanishes. Only the term for the correct class survives: The model is evaluated solely on the probability it assigned to the right answer. If it assigned 0.85, the loss is
. If it assigned 0.05, the loss is — a severe penalty.
Full Dataset — Mean CCE (Standard)
Dividing by
Full Dataset — Sum CCE (Special Cases)
Used in specific engineering contexts:
- Multi-GPU / distributed training: Each GPU computes its local sum, all sums are aggregated centrally, and then divided by the global
once — giving the true mean across an uneven data split. - Pure statistical likelihood: When the objective must remain a proper log-likelihood (e.g., for second-order optimizers), dividing by
would change the mathematical meaning from total dataset probability to per-sample expected probability.
👉 Summary: Softmax converts logits into a probability distribution. Categorical Cross-Entropy then evaluates the model by isolating and penalizing the probability assigned to the correct class.
III. Binary Classification (2 Classes)
Used when predicting one of two outcomes (e.g., market up or down, spam or not spam).
Step 1 — Representing the Truth
The label
Step 2 — Generating Probabilities: Sigmoid
Instead of Softmax, a Sigmoid function maps the model's logit to a probability between 0 and 1:
The model outputs a single value: the probability that the label is 1.
Step 3 — Computing the Loss: Binary Cross-Entropy (BCE)
Single Sample
This formula handles both cases with a single expression:
- When
: The second term drops out. Loss = . A prediction of 0.99 gives near-zero loss. A prediction of 0.01 gives a massive penalty. - When
: The first term drops out. Loss = . The model is penalized for assigning high probability to the wrong class.
Full Dataset — Mean BCE (Standard)
👉 Summary: BCE is a special case of CCE for
. Sigmoid replaces Softmax, and the one-hot summation collapses into the two-term expression above.
IV. Why Minimize Cross-Entropy? — The Negative Sign
Log of a probability (a number between 0 and 1) is always negative. To turn log-likelihood into a proper loss (something to minimize), we multiply by
This is why Cross-Entropy Loss = Negative Log-Likelihood. The names differ by convention (information theory vs. statistics), but the math is identical.
V. Quick Reference
| Binary | Multiclass | |
|---|---|---|
| Classes | 2 | |
| Output activation | Sigmoid | Softmax |
| Truth format | Scalar |
One-hot vector |
| Loss function | Binary Cross-Entropy | Categorical Cross-Entropy |
| Reduces to |