Root Mean Squared Error (RMSE)

Definition

RMSE is simply the square root of Mean Squared Error (MSE). It brings the error back to the original units of the target variable, making it much easier to interpret.

Formula:

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - \hat{y_{i}})^{2}}

Advantages

All the advantages of MSE are also in RMSE, but in addition to that one of the disadvantages is addressed in here to its advantage.

1. Same units as the target:

We had seen in MSE - disadvantage ➛ ❌ MSE output is Not in original units, (if predicting dollars, MSE is in dollars²). This issue is fixed in RMSE.
Loss is in the same units as your target variable If you're predicting amounts in dollars, RMSE is also in dollars. This makes it very interpretable.

2. Smooth and differentiable

Same as in MSE

3. Penalizes large errors heavily

Same as in MSE

4. Efficient convergence

Same as in MSE

4. Mathematical convenience

Same as in MSE

Disadvantages

RMSE (Root Mean Square Error) is derived directly from MSE, it inherits the same mathematical disadvantages.

1. High Sensitivity to Outliers

Issue: Large errors are magnified before they are averaged. A single outlier with an error of 10 adds 100 to the sum, while an error of 1 adds only 1.
Impact: "The "Root" at the end doesn't undo the fact that the outlier dominated the calculation. RMSE will always be pulled toward the outliers, potentially leading to a model that overfits to noise rather than the general trend.

2. The Scale & Comparison Problem

Same as in MSE

3. The Normal Distribution Assumption:

Same as in MSE

4. Non-Linearity of Error

One of the nuances of RMSE is that it is not a linear mapping of error.
The Implication: Doubling the RMSE value does not necessarily mean the average error magnitude has doubled. Because the errors are squared before being averaged, larger errors have a disproportionately large effect on the final value.
Example: An error of 10 contributes 100 to the sum of squares, while an error of 2 contributes only 4. The final square root operation does not change this underlying weighting.
Alternative: For a direct, linear interpretation of average error magnitude, Mean Absolute Error (MAE) is a better choice.

5. Gradient Issues Near Zero

Issue: The derivative of RMSE with respect to a single prediction error involves the term $\frac{1}{\sqrt{M S E}}$ . As the overall error (and thus MSE) approaches zero, this gradient can become very large or numerically unstable.
Impact: While less of a problem in practice with modern deep learning frameworks that use techniques like gradient clipping, it's a mathematical property to be aware of, especially when compared to the constant gradient of MAE.

6. Sum of Errors vs. Square Root

RMSE does not follow the "Scale of Errors" linearly. If you have two models, and one has twice the RMSE of the other, it doesn't mean it made twice as many mistakes.
This makes RMSE technically "biased" when comparing models across different sample sizes. MAE is much more consistent when you change the number of rows in your test set.

When to Use RMSE

You need a loss metric that's easy to interpret in the same units as your target
You want to penalize large errors more than small ones
Your data is relatively clean with few outliers
You're reporting model performance to non-technical stakeholders

When to Avoid RMSE

Your data has significant outliers
You're working with highly skewed distributions
You need a loss function that's robust to extreme values

Scaling and Practical Considerations

Since RMSE is just the square root of Mean Squared Error (MSE), it has identical scaling requirements:

Python Code Example

import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Load the mpg dataset
mpg = sns.load_dataset('mpg')
mpg = mpg.dropna()  # Remove missing values
print("Dataset shape:", mpg.shape)

# Predict mpg based on horsepower and weight
X = mpg[['horsepower', 'weight']].values
y = mpg['mpg'].values

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\nRoot Mean Squared Error (RMSE): {rmse:.4f} mpg")
print(f"This means, on average, our predictions are off by about {rmse:.2f} miles per gallon")

# Compare with MSE
mse = mean_squared_error(y_test, y_pred)
print(f"\nFor comparison:")
print(f"MSE: {mse:.4f} (squared mpg - hard to interpret)")
print(f"RMSE: {rmse:.4f} (mpg - easy to interpret)")

Output

Dataset shape: (392, 9)

Root Mean Squared Error (RMSE): 4.2180 mpg
This means, on average, our predictions are off by about 4.22 miles per gallon

For comparison:
MSE: 17.7918 (squared mpg - hard to interpret)
RMSE: 4.2180 (mpg - easy to interpret)