Root Mean Squared Error (RMSE)
Definition
RMSE is simply the square root of Mean Squared Error (MSE). It brings the error back to the original units of the target variable, making it much easier to interpret.
Formula:
Advantages
All the advantages of MSE are also in RMSE, but in addition to that one of the disadvantages is addressed in here to its advantage.
1. Same units as the target:
- We had seen in MSE - disadvantage ➛ ❌ MSE output is Not in original units, (if predicting dollars, MSE is in dollars²). This issue is fixed in RMSE.
- Loss is in the same units as your target variable If you're predicting amounts in dollars, RMSE is also in dollars. This makes it very interpretable.
2. Smooth and differentiable
- Same as in MSE
3. Penalizes large errors heavily
- Same as in MSE
4. Efficient convergence
- Same as in MSE
4. Mathematical convenience
- Same as in MSE
Disadvantages
RMSE (Root Mean Square Error) is derived directly from MSE, it inherits the same mathematical disadvantages.
1. High Sensitivity to Outliers
- Issue: Large errors are magnified before they are averaged. A single outlier with an error of 10 adds 100 to the sum, while an error of 1 adds only 1.
- Impact: "The "Root" at the end doesn't undo the fact that the outlier dominated the calculation. RMSE will always be pulled toward the outliers, potentially leading to a model that overfits to noise rather than the general trend.
2. The Scale & Comparison Problem
- Same as in MSE
3. The Normal Distribution Assumption:
- Same as in MSE
4. Non-Linearity of Error
- One of the nuances of RMSE is that it is not a linear mapping of error.
- The Implication: Doubling the RMSE value does not necessarily mean the average error magnitude has doubled. Because the errors are squared before being averaged, larger errors have a disproportionately large effect on the final value.
- Example: An error of 10 contributes 100 to the sum of squares, while an error of 2 contributes only 4. The final square root operation does not change this underlying weighting.
- Alternative: For a direct, linear interpretation of average error magnitude, Mean Absolute Error (MAE) is a better choice.
5. Gradient Issues Near Zero
- Issue: The derivative of RMSE with respect to a single prediction error involves the term
. As the overall error (and thus MSE) approaches zero, this gradient can become very large or numerically unstable. - Impact: While less of a problem in practice with modern deep learning frameworks that use techniques like gradient clipping, it's a mathematical property to be aware of, especially when compared to the constant gradient of MAE.
6. Sum of Errors vs. Square Root
- RMSE does not follow the "Scale of Errors" linearly. If you have two models, and one has twice the RMSE of the other, it doesn't mean it made twice as many mistakes.
- This makes RMSE technically "biased" when comparing models across different sample sizes. MAE is much more consistent when you change the number of rows in your test set.
When to Use RMSE
- You need a loss metric that's easy to interpret in the same units as your target
- You want to penalize large errors more than small ones
- Your data is relatively clean with few outliers
- You're reporting model performance to non-technical stakeholders
When to Avoid RMSE
- Your data has significant outliers
- You're working with highly skewed distributions
- You need a loss function that's robust to extreme values
Scaling and Practical Considerations
- Since RMSE is just the square root of Mean Squared Error (MSE), it has identical scaling requirements:
Python Code Example
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
# Load the mpg dataset
mpg = sns.load_dataset('mpg')
mpg = mpg.dropna() # Remove missing values
print("Dataset shape:", mpg.shape)
# Predict mpg based on horsepower and weight
X = mpg[['horsepower', 'weight']].values
y = mpg['mpg'].values
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\nRoot Mean Squared Error (RMSE): {rmse:.4f} mpg")
print(f"This means, on average, our predictions are off by about {rmse:.2f} miles per gallon")
# Compare with MSE
mse = mean_squared_error(y_test, y_pred)
print(f"\nFor comparison:")
print(f"MSE: {mse:.4f} (squared mpg - hard to interpret)")
print(f"RMSE: {rmse:.4f} (mpg - easy to interpret)")
Output
Dataset shape: (392, 9)
Root Mean Squared Error (RMSE): 4.2180 mpg
This means, on average, our predictions are off by about 4.22 miles per gallon
For comparison:
MSE: 17.7918 (squared mpg - hard to interpret)
RMSE: 4.2180 (mpg - easy to interpret)