Cross-Entropy Loss

Cross-entropy loss is the standard loss function for classification problems in machine learning. It is not an arbitrary choice — it is a direct consequence of Maximum Likelihood Estimation (MLE). When you minimize cross-entropy, you are mathematically finding the model parameters that make your observed training data the most probable outcome.

I. Background — Where Does It Come From?

Entropy

Entropy (from information theory) measures the uncertainty of a probability distribution. A distribution that is perfectly certain (one outcome has probability 1.0) has zero entropy. A uniform distribution (all outcomes equally likely) has maximum entropy.

Cross-Entropy

Cross-entropy measures the difference between two probability distributions:

$y$ — the true distribution (what actually happened)
$\hat{y}$ — the predicted distribution (what your model thinks)

H (y, \hat{y}) = - \sum_{c} y_{c} \log ({\hat{y}}_{c})

When the model's predicted distribution perfectly matches the truth, cross-entropy equals the entropy of the true distribution — its theoretical minimum. Any deviation from the truth increases the loss. This is exactly what we want: a function that is minimized only when predictions match reality.

Connection to MLE

From the MLE derivation, the negative log-likelihood for a classification model is:

J (θ) = - \frac{1}{N} \sum_{i = 1}^{N} \log P (y_{i} ∣ x_{i}; θ)

Cross-entropy loss is this formula, expressed in information-theoretic language. The two are equivalent. This is why cross-entropy is the principled, correct loss function for classification — not just a heuristic.

II. Multiclass Classification (3+ Classes)

Used when predicting one of three or more mutually exclusive outcomes (e.g., classifying an ETF as Growth, Value, or Dividend).

Step 1 — Representing the Truth: One-Hot Encoding

The ground truth label $y$ is encoded as a vector of zeros with a single $1$ at the correct class index. This is called one-hot encoding.

Example: For 3 classes and a true label of Value ( $c_{2}$ ):

y = [0, 1, 0]

Step 2 — Generating Probabilities: Softmax

The model outputs raw scores called logits ( $z_{c}$ ) for each class. The Softmax function converts them into a valid probability distribution that sums to 1.0:

{\hat{y}}_{c} = \frac{e^{z_{c}}}{\sum_{j = 1}^{K} e^{z_{j}}}

$z_{c}$ : Raw logit score for class $c$
$K$ : Total number of classes

Example output: $\hat{y} = [0.10, 0.85, 0.05]$ — the model is 85% confident it's a Value ETF.

Step 3 — Computing the Loss: Categorical Cross-Entropy (CCE)

Single Sample

L_{i} = - \sum_{c = 1}^{K} y_{i, c} \log ({\hat{y}}_{i, c})

Why this formula is elegant:
Because $y$ is one-hot encoded, it acts as a selector. Every term where $y_{i, c} = 0$ vanishes. Only the term for the correct class survives:
$L_{i} = - \log ({\hat{y}}_{correct class})$
The model is evaluated solely on the probability it assigned to the right answer. If it assigned 0.85, the loss is $- \log (0.85) \approx 0.16$ . If it assigned 0.05, the loss is $- \log (0.05) \approx 3.0$ — a severe penalty.

Full Dataset — Mean CCE (Standard)

L = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{K} y_{i, c} \log ({\hat{y}}_{i, c})

Dividing by $N$ normalizes the loss per sample. This is the default in virtually all ML frameworks because it keeps gradients stable regardless of batch size. Without it, a batch of 256 samples would produce 8× larger gradients than a batch of 32, causing erratic and unstable parameter updates — the exploding gradient problem.

Full Dataset — Sum CCE (Special Cases)

L_{s u m} = - \sum_{i = 1}^{N} \sum_{c = 1}^{K} y_{i, c} \log ({\hat{y}}_{i, c})

Used in specific engineering contexts:

Multi-GPU / distributed training: Each GPU computes its local sum, all sums are aggregated centrally, and then divided by the global $N$ once — giving the true mean across an uneven data split.
Pure statistical likelihood: When the objective must remain a proper log-likelihood (e.g., for second-order optimizers), dividing by $N$ would change the mathematical meaning from total dataset probability to per-sample expected probability.

👉 Summary: Softmax converts logits into a probability distribution. Categorical Cross-Entropy then evaluates the model by isolating and penalizing the probability assigned to the correct class.

III. Binary Classification (2 Classes)

Used when predicting one of two outcomes (e.g., market up or down, spam or not spam).

Step 1 — Representing the Truth

The label $y_{i} \in {0, 1}$ is a single scalar. No one-hot encoding needed.

Step 2 — Generating Probabilities: Sigmoid

Instead of Softmax, a Sigmoid function maps the model's logit to a probability between 0 and 1:

{\hat{y}}_{i} = σ (z_{i}) = \frac{1}{1 + e^{- z_{i}}}

The model outputs a single value: the probability that the label is 1.

Step 3 — Computing the Loss: Binary Cross-Entropy (BCE)

Single Sample

L_{i} = - [y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i})]

This formula handles both cases with a single expression:

When $y_{i} = 1$ : The second term drops out. Loss = $- \log ({\hat{y}}_{i})$ . A prediction of 0.99 gives near-zero loss. A prediction of 0.01 gives a massive penalty.
When $y_{i} = 0$ : The first term drops out. Loss = $- \log (1 - {\hat{y}}_{i})$ . The model is penalized for assigning high probability to the wrong class.

Full Dataset — Mean BCE (Standard)

L = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i})]

👉 Summary: BCE is a special case of CCE for $K = 2$ . Sigmoid replaces Softmax, and the one-hot summation collapses into the two-term expression above.

IV. Why Minimize Cross-Entropy? — The Negative Sign

Log of a probability (a number between 0 and 1) is always negative. To turn log-likelihood into a proper loss (something to minimize), we multiply by $- 1$ . This flips the optimization landscape from a peak to a valley — and Gradient Descent, which is designed to find the bottom of a valley, lands at the exact same optimal parameters.

\underset{minimize this}{\underset{⏟}{- \frac{1}{N} \sum \log P (y_{i} ∣ x_{i}; θ)}} \equiv \underset{same solution}{\underset{⏟}{\underset{θ}{\arg max} \sum \log P (y_{i} ∣ x_{i}; θ)}}

This is why Cross-Entropy Loss = Negative Log-Likelihood. The names differ by convention (information theory vs. statistics), but the math is identical.

V. Quick Reference

	Binary	Multiclass
Classes	2	$K \geq 3$
Output activation	Sigmoid	Softmax
Truth format	Scalar $\in {0, 1}$	One-hot vector
Loss function	Binary Cross-Entropy	Categorical Cross-Entropy
Reduces to	$- \log (\hat{y})$ when $y = 1$	$- \log ({\hat{y}}_{correct})$ always