Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is the foundational principle behind training most machine learning models — including logistic regression and neural networks. At its core, it answers one question: given the data I observed, what model parameters best explain it?

I. Probability vs. Likelihood

In everyday English, these words are used interchangeably. In statistics, they point in opposite directions.

Probability Likelihood
Direction Forward-looking Backward-looking
Known Parameters (the rules) Data (the outcome)
Unknown The outcome The parameters (the rules)
Notation P(Dataθ) L(θData)

Example:

Same event, opposite framing.

II. What Is Maximum Likelihood Estimation?

If Likelihood measures how well a set of parameters explains the observed data, Maximum Likelihood Estimation (MLE) finds the best possible parameters — the ones that make your observed dataset the most probable outcome.

Intuition: Imagine building a model to predict whether the S&P 500 will go up tomorrow. You have a year of daily data.

  1. Your model starts with random parameters.
  2. It asks: "If these parameters were the true rules of the market, how probable is it that we'd see exactly the history that actually happened?"
  3. MLE is the mathematical engine that continuously adjusts those parameters until that probability reaches its peak.

When the likelihood is maximized, your model's parameters are the best possible explanation of the historical data you gave it.

III. The Likelihood Function

To measure how well a set of parameters θ explains an entire dataset, we multiply the individual probabilities of each observation:

L(θ)=i=1NP(yixi;θ)

Goal: Find the θ that maximizes L(θ).

IV. The Problem — Computational Underflow

In practice, N is not 10 or 100. It's often millions of data points.

Each individual probability is a decimal between 0 and 1 (e.g., 0.91, 0.87, 0.94...). Multiplying millions of such decimals together produces a number so astronomically small — like 10100000 — that a computer's floating-point processor rounds it to exactly zero. This is called computational underflow, and it makes direct optimization of L(θ) numerically impossible.

V. The Fix — Log-Likelihood

The solution is to apply the natural logarithm to the likelihood function.

A fundamental property of logarithms converts products into sums:

log(A×B)=log(A)+log(B)

Applying this to the likelihood:

(θ)=logL(θ)=i=1NlogP(yixi;θ)

Why this works:

  1. Numerical stability — We are now adding numbers instead of multiplying decimals. Even though log of a probability (a value between 0 and 1) is negative, the sum stays in a range computers handle easily.
  2. Same optimal point — The logarithm is a strictly increasing function. Whatever θ maximizes L(θ) will also maximize (θ). We changed the scale, not the location of the peak.
argmaxθL(θ)=argmaxθ(θ)

VI. From Log-Likelihood to Loss — The Machine Learning Convention

There is one final gap to bridge. Machine learning optimizers — like Gradient Descent — are built to minimize a function, not maximize one. They are designed to roll downhill into a valley, not climb to a peak.

The fix is simple: multiply by 1. This flips the mountain into a valley. We also divide by N to get the average loss per sample, which keeps training stable regardless of dataset or batch size.

J(θ)=1Ni=1NlogP(yixi;θ)

This is the Negative Log-Likelihood (NLL) loss. Minimizing it is mathematically equivalent to performing MLE.

VII. The Binary Cross-Entropy Loss — A Concrete Derivation

For a binary classification problem (e.g., up or down, spam or not), the model outputs a probability y^i=P(yi=1xi;θ), typically via a sigmoid function.

For a single data point, the probability of the correct label is:

P(yixi;θ)=y^iyi(1y^i)1yi

This is a clean trick: when yi=1, it evaluates to y^i. When yi=0, it evaluates to 1y^i. Taking the log:

logP(yixi;θ)=yilog(y^i)+(1yi)log(1y^i)

Plugging into the NLL loss gives the Binary Cross-Entropy (Log Loss) formula:

J(θ)=1Ni=1N[yilog(y^i)+(1yi)log(1y^i)]

Breaking down the two terms:

Only one term fires per data point. Together they cover both cases.

VIII. The Full Picture

MLElogLog-Likelihood×1NBinary Cross-Entropy Loss

Every time you train a logistic regression or a neural network with binary cross-entropy loss, you are — under the hood — performing Maximum Likelihood Estimation. The loss function is not an arbitrary choice; it is a direct mathematical consequence of the probabilistic model you assumed.