Mean Absolute Percentage Error (MAPE)
Definition
MAPE measures the average percentage error between predicted and actual values. It expresses the error as a percentage of the actual value, making it highly interpretable, especially for business stakeholders.
Formula:
Advantages
1. Easy to Interpret (Percentage-Based Understanding)
- Expressed as a percentage (e.g., "model predictions are off by 15% on average").
- Universal language: Everyone understands "10% error" without needing context about units or scales.
- Immediate actionability: A 5% MAPE might be acceptable for sales forecasting, but terrible for financial modeling.
- No mental math required: Unlike MSE or MAE which require knowledge of the target variable's range.
- Analogy: Saying "You're 10% off" is like saying "You got 90% correct on a test"—instantly understandable regardless of the subject matter.
2. Scale-Independent (Cross-Dataset Comparison)
- Can compare model performance across datasets with different scales.
- Example: You can compare MAPE for predicting house prices ($100k-$1M) with MAPE for predicting car prices ($20k-$50k).
- Benchmarking: Industry benchmarks are meaningful—"good" forecasting models typically achieve <10% MAPE.
- Portfolio comparison: Compare prediction accuracy across completely different products or markets.
- Why it matters: MSE of 1000 is meaningless without context. MAPE of 10% is always 10%.
3. Business-Friendly (Stakeholder Communication)
- Stakeholders understand percentages better than squared errors or absolute differences.
- C-suite ready: You can present to executives without explaining complex metrics.
- Decision-making: "We're 15% off" directly informs budget planning, inventory management, etc.
- Client reporting: Clients immediately grasp what "5% error" means for their business.
- No translation needed: Unlike RMSE which needs context, MAPE speaks for itself.
4. Intuitive Relative Magnitude
- Shows the relative magnitude of errors, not just absolute differences.
- Context-aware: Being $10 off on a $100 item (10%) is very different from being $10 off on a $10,000 item (0.1%).
- Proportional thinking: Aligns with how people naturally think about errors—"How much did I miss by, percentage-wise?"
- Fair comparison: Small and large values are compared on equal footing in relative terms.
Disadvantages
1. Undefined for Zero Values (Division by Zero Problem)
- Division by zero when any actual value is zero or very close to zero.
- Fatal flaw: If
, then is undefined (division by zero). - Near-zero problem: Even if
, a small prediction error creates massive percentage errors. - Example: Actual = 0, Predicted = 1 → MAPE is undefined. Actual = 0.01, Predicted = 1 → MAPE = 9,900%!
- Impact: Cannot use MAPE for datasets with zeros (e.g., product demand that can be zero, click-through rates).
- The Fix: Use symmetric MAPE (sMAPE) or add a small constant (epsilon) to denominator, though both have their own issues.
2. Asymmetric Penalty (Biased Toward Underprediction)
- Penalizes overprediction less than underprediction due to the division by actual value.
- Mathematical asymmetry:
- Actual = 100, Predicted = 150 → Error = 50%
- Actual = 100, Predicted = 50 → Error = 50%
- But! Predicting 150 when actual is 100 is NOT the same impact as predicting 50.
- Real example: Overestimating sales by 50% vs underestimating by 50% have different business consequences, but MAPE treats them differently.
- Impact: Models optimized for MAPE tend to overpredict to avoid the heavier penalty of underprediction.
- Why it happens: The denominator (
) creates this bias—overpredictions have the same numerator but different "impact" calculation.
3. Biased Toward Low Values (Small Denominator Problem)
- Small actual values can result in very large percentage errors, dominating the metric.
- Example:
- Actual = $1, Predicted = $2 → 100% error
- Actual = $1000, Predicted = $1001 → 0.1% error
- The first error (off by $1) contributes 1000x more to MAPE than the second (also off by $1)!
- Impact: The model will "obsess" over getting small values right and may ignore large values.
- Real-world problem: In sales forecasting, products with low sales dominate MAPE, even if high-selling products have larger absolute errors.
- Unfair weighting: A $1 error on a $10 item has 100x more impact on MAPE than a $1 error on a $1000 item.
4. Can Be Misleading (Equal Percentage ≠ Equal Importance)
- A 10% error on $10 is weighted the same as a 10% error on $1,000,000, which may not reflect real-world importance.
- Business reality: Being 10% off on a $1M contract is a $100k error. Being 10% off on $10 is a $1 error. MAPE treats these equally.
- Strategic blind spot: High-value predictions can have large absolute errors while MAPE looks "good."
- Example: Inventory costs—being 10% off on expensive items can bankrupt you, while 10% error on cheap items is negligible.
- Impact: MAPE can give false confidence when you're making catastrophic errors on high-value items.
When to Use MAPE
- You need a percentage-based metric that's easy to explain to non-technical stakeholders
- Your target values are all positive and away from zero
- You want a scale-independent metric to compare across datasets
- Business context requires percentage errors
When to Avoid MAPE
- Your data contains zeros or values very close to zero
- You want to treat over and underprediction symmetrically
- Errors on large values should be weighted more heavily
Scaling and Practical Considerations
1. Does MAPE Need Scaled Data?
The short answer: No—MAPE is inherently scale-invariant.
The real answer: Feature scaling helps model training, but MAPE should ALWAYS be calculated on original-scale data. Never scale the target when using MAPE.
2. Key Insight: MAPE is Scale-Invariant by Design
Why MAPE is different:
- MAPE divides by the actual value (
), creating a self-normalizing metric. - Whether your target is in dollars, millions of dollars, or thousands of units, MAPE gives the same percentage result.
- Example:
- Prices in dollars: Actual = $100, Predicted = $110 → MAPE = 10%
- Prices in cents: Actual = 10,000¢, Predicted = 11,000¢ → MAPE = 10%
- Same error, same MAPE, regardless of units!
Why this matters:
- ✅ Advantage: You can compare models across completely different scales
- ❌ Disadvantage: Small actual values can explode the metric (dividing by small numbers creates huge percentages)
Analogy: MAPE is like grading on a curve—it automatically adjusts for the "difficulty" (scale) of each prediction. But just like curves can be unfair to edge cases, MAPE struggles with small values.
3. When does scaling help?
★ Feature Scaling - Recommended for Model Training
Always scale features, but keep target unscaled
- Gradient-based models (Neural Networks, SGD): Feature scaling improves convergence and training stability.
- Regularized models (Lasso, Ridge): Essential for fair regularization across features.
- Distance-based models (KNN, SVM): Required for fair distance calculations.
- Key point: Feature scaling helps the MODEL, not the METRIC. Always compute MAPE on original scale.
★ Target Scaling - DANGEROUS with MAPE
This is where things go wrong
The critical warning: MAPE's formula
What happens with target scaling:
Problem 1: MinMax Scaling (0-1)
# Original: Actual = $5, Predicted = $6
# MAPE = |5-6|/5 * 100 = 20%
# After MinMax scaling (range $1-$100):
# Scaled actual = 0.04, Scaled predicted = 0.05
# MAPE = |0.04-0.05|/0.04 * 100 = 25% ← DIFFERENT!
# The denominator changed, so MAPE changed!
Problem 2: Standardization (mean=0, std=1)
# Actual = $50 (scaled to 0 after standardization, mean=$50)
# Predicted = $55 (scaled to 0.25)
# MAPE = |0 - 0.25|/0 → Division by zero!
# Even if not exactly zero, negative values create nonsense:
# MAPE = |-2.25 - (-2.20)|/(-2.25) * 100 = -2.22% ← Negative percentage!?
Problem 3: Log Transformation
# MAPE on log scale has no interpretable meaning
# log($100) = 4.6, but MAPE = 10% on logs ≠ 10% on original scale
4. Effect of Scaling on MAPE
| Scaling Type | Effect on MAPE | Recommendation |
|---|---|---|
| Feature Scaling Only | No direct effect on MAPE; improves model training | ✅ Recommended |
| Target MinMax (0-1) | Changes MAPE values; amplifies errors on small values | ❌ Don't use |
| Target Standardization | Creates negative denominators; MAPE becomes meaningless | ❌ Never use |
| Target Log Transform | MAPE percentages lose meaning; not on original scale | ❌ Avoid |
| No target scaling | MAPE works naturally as intended | ✅ Best for MAPE |
5. Why MAPE Breaks with Target Scaling
The mathematical reason:
- MAPE is a ratio metric—it compares error magnitude to actual value magnitude
- Scaling transforms the values but not the ratio uniformly
- Example:
- Original: 100 vs 110 → ratio = 1.1 → 10% error
- MinMax scaled: 0.1 vs 0.11 → ratio = 1.1 → 10% error ✅ Seems OK
- But: 5 vs 6 → ratio = 1.2 → 20% error
- MinMax scaled: 0.04 vs 0.05 → ratio = 1.25 → 25% error ❌ Different!
The intuition:
- MAPE already "normalizes" by dividing by
- Additional scaling double-normalizes and breaks the percentage interpretation
- Analogy: MAPE is already wearing glasses (built-in normalization). Putting another pair of glasses on top (scaling) blurs the vision.
6. Best Practice for MAPE
-
✅ Standardize FEATURES (StandardScaler) for better model training
from sklearn.preprocessing import StandardScaler scaler_X = StandardScaler() X_train_scaled = scaler_X.fit_transform(X_train) X_test_scaled = scaler_X.transform(X_test) -
❌ NEVER scale the TARGET when using MAPE as your metric
-
✅ If you must train on scaled targets (e.g., for neural networks), ALWAYS inverse transform before calculating MAPE:
# Train on scaled data scaler_y = StandardScaler() y_train_scaled = scaler_y.fit_transform(y_train.reshape(-1, 1)).ravel() model.fit(X_train_scaled, y_train_scaled) # Predict and inverse transform BEFORE calculating MAPE y_pred_scaled = model.predict(X_test_scaled) y_pred = scaler_y.inverse_transform(y_pred_scaled.reshape(-1, 1)).ravel() # Now calculate MAPE on original scale from sklearn.metrics import mean_absolute_percentage_error mape = mean_absolute_percentage_error(y_test, y_pred) # y_test is original scale print(f"MAPE: {mape * 100:.2f}%") -
⚠️ Handle zeros before using MAPE:
# Option 1: Filter out zeros mask = y_test != 0 mape = mean_absolute_percentage_error(y_test[mask], y_pred[mask]) # Option 2: Add small epsilon (less recommended) epsilon = 1e-10 mape = np.mean(np.abs((y_test - y_pred) / (y_test + epsilon))) * 100 # Option 3: Use symmetric MAPE instead smape = np.mean(2 * np.abs(y_pred - y_test) / (np.abs(y_test) + np.abs(y_pred))) * 100 -
💡 Consider weighted MAPE if large values are more important:
# Weighted by actual value (gives more weight to large values) weighted_mape = np.sum(np.abs(y_test - y_pred)) / np.sum(np.abs(y_test)) * 100
7. When Scaling Creates Issues: A Complete Example
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_absolute_percentage_error
# Original data
y_test = np.array([5, 10, 50, 100, 500])
y_pred = np.array([6, 11, 55, 110, 550])
# MAPE on original scale (CORRECT)
mape_original = mean_absolute_percentage_error(y_test, y_pred)
print(f"MAPE (original scale): {mape_original * 100:.2f}%") # ~10%
# WRONG: MAPE after standardization
scaler = StandardScaler()
y_test_scaled = scaler.fit_transform(y_test.reshape(-1, 1)).ravel()
y_pred_scaled = scaler.transform(y_pred.reshape(-1, 1)).ravel()
# This will likely error or give nonsense due to negative/zero values
# WRONG: MAPE after MinMax scaling
scaler2 = MinMaxScaler()
y_test_minmax = scaler2.fit_transform(y_test.reshape(-1, 1)).ravel()
y_pred_minmax = scaler2.transform(y_pred.reshape(-1, 1)).ravel()
mape_minmax = mean_absolute_percentage_error(y_test_minmax, y_pred_minmax)
print(f"MAPE (MinMax scaled): {mape_minmax * 100:.2f}%") # Different from 10%!
# CORRECT: Inverse transform before MAPE
y_pred_original = scaler2.inverse_transform(y_pred_minmax.reshape(-1, 1)).ravel()
mape_correct = mean_absolute_percentage_error(y_test, y_pred_original)
print(f"MAPE (inverse transformed): {mape_correct * 100:.2f}%") # Back to ~10%