Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is the foundational principle behind training most machine learning models — including logistic regression and neural networks. At its core, it answers one question: given the data I observed, what model parameters best explain it?

I. Probability vs. Likelihood

In everyday English, these words are used interchangeably. In statistics, they point in opposite directions.

	Probability	Likelihood
Direction	Forward-looking	Backward-looking
Known	Parameters (the rules)	Data (the outcome)
Unknown	The outcome	The parameters (the rules)
Notation	$P (Data ∣ θ)$	$L (θ ∣ Data)$

Example:

Probability: You know a coin is fair. What is the probability it lands Heads 8 times in a row? (Answer: very low, ~0.4%)
Likelihood: You find a coin on the street and flip it — it lands Heads 8 times in a row. How likely is it that this coin is fair? (Answer: also very low — the coin is probably rigged.)

Same event, opposite framing.

II. What Is Maximum Likelihood Estimation?

If Likelihood measures how well a set of parameters explains the observed data, Maximum Likelihood Estimation (MLE) finds the best possible parameters — the ones that make your observed dataset the most probable outcome.

Intuition: Imagine building a model to predict whether the S&P 500 will go up tomorrow. You have a year of daily data.

Your model starts with random parameters.
It asks: "If these parameters were the true rules of the market, how probable is it that we'd see exactly the history that actually happened?"
MLE is the mathematical engine that continuously adjusts those parameters until that probability reaches its peak.

When the likelihood is maximized, your model's parameters are the best possible explanation of the historical data you gave it.

III. The Likelihood Function

To measure how well a set of parameters $θ$ explains an entire dataset, we multiply the individual probabilities of each observation:

L (θ) = \prod_{i = 1}^{N} P (y_{i} ∣ x_{i}; θ)

$\prod$ — multiply everything together: $p_{1} \times p_{2} \times p_{3} \dots$
$P (y_{i} ∣ x_{i}; θ)$ — the probability of the actual label $y_{i}$ , given input $x_{i}$ and current parameters $θ$

Goal: Find the $θ$ that maximizes $L (θ)$ .

IV. The Problem — Computational Underflow

In practice, $N$ is not 10 or 100. It's often millions of data points.

Each individual probability is a decimal between 0 and 1 (e.g., 0.91, 0.87, 0.94...). Multiplying millions of such decimals together produces a number so astronomically small — like $10^{- 100000}$ — that a computer's floating-point processor rounds it to exactly zero. This is called computational underflow, and it makes direct optimization of $L (θ)$ numerically impossible.

V. The Fix — Log-Likelihood

The solution is to apply the natural logarithm to the likelihood function.

A fundamental property of logarithms converts products into sums:

\log (A \times B) = \log (A) + \log (B)

Applying this to the likelihood:

ℓ (θ) = \log L (θ) = \sum_{i = 1}^{N} \log P (y_{i} ∣ x_{i}; θ)

Why this works:

Numerical stability — We are now adding numbers instead of multiplying decimals. Even though $\log$ of a probability (a value between 0 and 1) is negative, the sum stays in a range computers handle easily.
Same optimal point — The logarithm is a strictly increasing function. Whatever $θ$ maximizes $L (θ)$ will also maximize $ℓ (θ)$ . We changed the scale, not the location of the peak.

\underset{θ}{\arg max} L (θ) = \underset{θ}{\arg max} ℓ (θ)

VI. From Log-Likelihood to Loss — The Machine Learning Convention

There is one final gap to bridge. Machine learning optimizers — like Gradient Descent — are built to minimize a function, not maximize one. They are designed to roll downhill into a valley, not climb to a peak.

The fix is simple: multiply by $- 1$ . This flips the mountain into a valley. We also divide by $N$ to get the average loss per sample, which keeps training stable regardless of dataset or batch size.

J (θ) = - \frac{1}{N} \sum_{i = 1}^{N} \log P (y_{i} ∣ x_{i}; θ)

This is the Negative Log-Likelihood (NLL) loss. Minimizing it is mathematically equivalent to performing MLE.

VII. The Binary Cross-Entropy Loss — A Concrete Derivation

For a binary classification problem (e.g., up or down, spam or not), the model outputs a probability ${\hat{y}}_{i} = P (y_{i} = 1 ∣ x_{i}; θ)$ , typically via a sigmoid function.

For a single data point, the probability of the correct label is:

P (y_{i} ∣ x_{i}; θ) = {\hat{y}}_{i}^{y_{i}} \cdot (1 - {\hat{y}}_{i})^{1 - y_{i}}

This is a clean trick: when $y_{i} = 1$ , it evaluates to ${\hat{y}}_{i}$ . When $y_{i} = 0$ , it evaluates to $1 - {\hat{y}}_{i}$ . Taking the log:

\log P (y_{i} ∣ x_{i}; θ) = y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i})

Plugging into the NLL loss gives the Binary Cross-Entropy (Log Loss) formula:

J (θ) = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i})]

Breaking down the two terms:

$y_{i} \log ({\hat{y}}_{i})$ — is active when the true label is 1. It penalizes the model for assigning a low probability to a positive outcome.
$(1 - y_{i}) \log (1 - {\hat{y}}_{i})$ — is active when the true label is 0. It penalizes the model for assigning a high probability to a negative outcome.

Only one term fires per data point. Together they cover both cases.

VIII. The Full Picture

MLE \overset{\log}{\to} Log-Likelihood \overset{\times - \frac{1}{N}}{\to} Binary Cross-Entropy Loss

Every time you train a logistic regression or a neural network with binary cross-entropy loss, you are — under the hood — performing Maximum Likelihood Estimation. The loss function is not an arbitrary choice; it is a direct mathematical consequence of the probabilistic model you assumed.