Mean Absolute Error (MAE) / L1 Loss

Definition

MAE is the simplest and most intuitive loss function. It calculates the average of the absolute differences between predicted and actual values. Unlike MSE, it doesn't square the errors, treating all errors on the same scale.

Individual Loss (L1):

L(y,y^)=|yy^|

Mean Absolute Error:

MAE=1ni=1n|yiyi^|

Advantages

1. Robust to Outliers
2. Real-World Units
3. Linear Fairness
4. Mathematical Simplicity

Disadvantages

1. Non-Differentiable at Zero
2. "Blind" to Large Failures
3. Slower "Last-Mile" Convergence
4. Less Stable Solutions (Non-Unique Minima)
Statistical Nuance: "Mean vs. Median"

  • MSE targets the Mean: If you minimize MSE, your model is trying to predict the Average value.
  • MAE targets the Median: If you minimize MAE, your model is trying to predict the Median value.

Why this matters: If you are predicting salaries, the Mean is pulled up by billionaires (outliers), but the Median stays with the "normal" people. This explains why MAE is robust—it literally ignores the extremes by design.

When to Use MAE

When to Avoid MAE

Scaling and Practical Considerations

1. Does MAE Need Scaled Data?

The short answer: Technically, No. MAE is less sensitive to scaling than MSE.
The real answer: Practically, Yes for most models. While MAE doesn't square errors, scaling still matters for optimization and fair feature treatment.

2. Key Insight: MAE is More Scale-Robust Than MSE

3. When does scaling help?

★ Regularized Models (Lasso, Ridge with MAE)

Required for fair regularization

★ Neural Networks

Essential for stable training

★ Multi-Feature Models

Prevents scale-based feature dominance

★ Distance-Based Models (KNN, SVM)

Essential for fair distance calculations

4. When scaling isn't necessary?

5. Effect of Scaling on MAE

Without scaling:

# Feature: Square Footage (0-10000), Bedrooms (0-10)
# An error of 10 in square footage: |10| = 10
# An error of 1 in bedrooms: |1| = 1
# Square footage errors dominate 10:1, even though both might be equally important

With scaling (StandardScaler):

# Both features normalized to mean=0, std=1
# An error of 0.5 std in square footage: |0.5| = 0.5
# An error of 0.5 std in bedrooms: |0.5| = 0.5
# Features contribute equally based on predictive power, not numeric range

Gradient impact:

6. Key Difference from MSE

7. Best Practice for MAE