Cross-Entropy Loss

Cross-entropy loss is the standard loss function for classification problems in machine learning. It is not an arbitrary choice — it is a direct consequence of Maximum Likelihood Estimation (MLE). When you minimize cross-entropy, you are mathematically finding the model parameters that make your observed training data the most probable outcome.


I. Background — Where Does It Come From?

Entropy

Entropy (from information theory) measures the uncertainty of a probability distribution. A distribution that is perfectly certain (one outcome has probability 1.0) has zero entropy. A uniform distribution (all outcomes equally likely) has maximum entropy.

Cross-Entropy

Cross-entropy measures the difference between two probability distributions:

H(y,y^)=cyclog(y^c)

When the model's predicted distribution perfectly matches the truth, cross-entropy equals the entropy of the true distribution — its theoretical minimum. Any deviation from the truth increases the loss. This is exactly what we want: a function that is minimized only when predictions match reality.

Connection to MLE

From the MLE derivation, the negative log-likelihood for a classification model is:

J(θ)=1Ni=1NlogP(yixi;θ)

Cross-entropy loss is this formula, expressed in information-theoretic language. The two are equivalent. This is why cross-entropy is the principled, correct loss function for classification — not just a heuristic.


II. Multiclass Classification (3+ Classes)

Used when predicting one of three or more mutually exclusive outcomes (e.g., classifying an ETF as Growth, Value, or Dividend).

Step 1 — Representing the Truth: One-Hot Encoding

The ground truth label y is encoded as a vector of zeros with a single 1 at the correct class index. This is called one-hot encoding.

Example: For 3 classes and a true label of Value (c2):

y=[0, 1, 0]

Step 2 — Generating Probabilities: Softmax

The model outputs raw scores called logits (zc) for each class. The Softmax function converts them into a valid probability distribution that sums to 1.0:

y^c=ezcj=1Kezj

Example output: y^=[0.10, 0.85, 0.05] — the model is 85% confident it's a Value ETF.

Step 3 — Computing the Loss: Categorical Cross-Entropy (CCE)

Single Sample

Li=c=1Kyi,clog(y^i,c)

Why this formula is elegant:
Because y is one-hot encoded, it acts as a selector. Every term where yi,c=0 vanishes. Only the term for the correct class survives:

Li=log(y^correct class)

The model is evaluated solely on the probability it assigned to the right answer. If it assigned 0.85, the loss is log(0.85)0.16. If it assigned 0.05, the loss is log(0.05)3.0 — a severe penalty.

Full Dataset — Mean CCE (Standard)

L=1Ni=1Nc=1Kyi,clog(y^i,c)

Dividing by N normalizes the loss per sample. This is the default in virtually all ML frameworks because it keeps gradients stable regardless of batch size. Without it, a batch of 256 samples would produce 8× larger gradients than a batch of 32, causing erratic and unstable parameter updates — the exploding gradient problem.

Full Dataset — Sum CCE (Special Cases)

Lsum=i=1Nc=1Kyi,clog(y^i,c)

Used in specific engineering contexts:

👉 Summary: Softmax converts logits into a probability distribution. Categorical Cross-Entropy then evaluates the model by isolating and penalizing the probability assigned to the correct class.


III. Binary Classification (2 Classes)

Used when predicting one of two outcomes (e.g., market up or down, spam or not spam).

Step 1 — Representing the Truth

The label yi{0,1} is a single scalar. No one-hot encoding needed.

Step 2 — Generating Probabilities: Sigmoid

Instead of Softmax, a Sigmoid function maps the model's logit to a probability between 0 and 1:

y^i=σ(zi)=11+ezi

The model outputs a single value: the probability that the label is 1.

Step 3 — Computing the Loss: Binary Cross-Entropy (BCE)

Single Sample

Li=[yilog(y^i)+(1yi)log(1y^i)]

This formula handles both cases with a single expression:

Full Dataset — Mean BCE (Standard)

L=1Ni=1N[yilog(y^i)+(1yi)log(1y^i)]

👉 Summary: BCE is a special case of CCE for K=2. Sigmoid replaces Softmax, and the one-hot summation collapses into the two-term expression above.


IV. Why Minimize Cross-Entropy? — The Negative Sign

Log of a probability (a number between 0 and 1) is always negative. To turn log-likelihood into a proper loss (something to minimize), we multiply by 1. This flips the optimization landscape from a peak to a valley — and Gradient Descent, which is designed to find the bottom of a valley, lands at the exact same optimal parameters.

1NlogP(yixi;θ)minimize thisargmaxθlogP(yixi;θ)same solution

This is why Cross-Entropy Loss = Negative Log-Likelihood. The names differ by convention (information theory vs. statistics), but the math is identical.


V. Quick Reference

Binary Multiclass
Classes 2 K3
Output activation Sigmoid Softmax
Truth format Scalar {0,1} One-hot vector
Loss function Binary Cross-Entropy Categorical Cross-Entropy
Reduces to log(y^) when y=1 log(y^correct) always