Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is the foundational principle behind training most machine learning models — including logistic regression and neural networks. At its core, it answers one question: given the data I observed, what model parameters best explain it?
I. Probability vs. Likelihood
In everyday English, these words are used interchangeably. In statistics, they point in opposite directions.
| Probability | Likelihood | |
|---|---|---|
| Direction | Forward-looking | Backward-looking |
| Known | Parameters (the rules) | Data (the outcome) |
| Unknown | The outcome | The parameters (the rules) |
| Notation |
Example:
- Probability: You know a coin is fair. What is the probability it lands Heads 8 times in a row? (Answer: very low, ~0.4%)
- Likelihood: You find a coin on the street and flip it — it lands Heads 8 times in a row. How likely is it that this coin is fair? (Answer: also very low — the coin is probably rigged.)
Same event, opposite framing.
II. What Is Maximum Likelihood Estimation?
If Likelihood measures how well a set of parameters explains the observed data, Maximum Likelihood Estimation (MLE) finds the best possible parameters — the ones that make your observed dataset the most probable outcome.
Intuition: Imagine building a model to predict whether the S&P 500 will go up tomorrow. You have a year of daily data.
- Your model starts with random parameters.
- It asks: "If these parameters were the true rules of the market, how probable is it that we'd see exactly the history that actually happened?"
- MLE is the mathematical engine that continuously adjusts those parameters until that probability reaches its peak.
When the likelihood is maximized, your model's parameters are the best possible explanation of the historical data you gave it.
III. The Likelihood Function
To measure how well a set of parameters
— multiply everything together: — the probability of the actual label , given input and current parameters
Goal: Find the
IV. The Problem — Computational Underflow
In practice,
Each individual probability is a decimal between 0 and 1 (e.g., 0.91, 0.87, 0.94...). Multiplying millions of such decimals together produces a number so astronomically small — like
V. The Fix — Log-Likelihood
The solution is to apply the natural logarithm to the likelihood function.
A fundamental property of logarithms converts products into sums:
Applying this to the likelihood:
Why this works:
- Numerical stability — We are now adding numbers instead of multiplying decimals. Even though
of a probability (a value between 0 and 1) is negative, the sum stays in a range computers handle easily. - Same optimal point — The logarithm is a strictly increasing function. Whatever
maximizes will also maximize . We changed the scale, not the location of the peak.
VI. From Log-Likelihood to Loss — The Machine Learning Convention
There is one final gap to bridge. Machine learning optimizers — like Gradient Descent — are built to minimize a function, not maximize one. They are designed to roll downhill into a valley, not climb to a peak.
The fix is simple: multiply by
This is the Negative Log-Likelihood (NLL) loss. Minimizing it is mathematically equivalent to performing MLE.
VII. The Binary Cross-Entropy Loss — A Concrete Derivation
For a binary classification problem (e.g., up or down, spam or not), the model outputs a probability
For a single data point, the probability of the correct label is:
This is a clean trick: when
Plugging into the NLL loss gives the Binary Cross-Entropy (Log Loss) formula:
Breaking down the two terms:
— is active when the true label is 1. It penalizes the model for assigning a low probability to a positive outcome. — is active when the true label is 0. It penalizes the model for assigning a high probability to a negative outcome.
Only one term fires per data point. Together they cover both cases.
VIII. The Full Picture
Every time you train a logistic regression or a neural network with binary cross-entropy loss, you are — under the hood — performing Maximum Likelihood Estimation. The loss function is not an arbitrary choice; it is a direct mathematical consequence of the probabilistic model you assumed.