Log-Cosh Loss

Definition

Log-Cosh Loss is the logarithm of the hyperbolic cosine of the prediction error. It's a smooth approximation to the absolute error that combines advantages of both MSE and MAE. It behaves like MSE for small errors and like MAE for large errors, but it's smoother than Huber loss and doesn't require tuning a hyperparameter.

Formula:

Log-Cosh=i=1nlog(cosh(yi^yi))

Where cosh(x)=ex+ex2

Advantages

1. Smooth Everywhere
2. Robust to Outliers
3. No Hyperparameter Needed
4. Symmetric and Fair
5. Works Well with Gradient Descent

Disadvantages

1. Computationally Expensive
2. Less Interpretable
3. Not Always Better
4. Potential for Numerical Instability

When to Use Log-Cosh Loss

When to Avoid Log-Cosh Loss

Scaling and Practical Considerations

1. Does Log-Cosh Loss Need Scaled Data?

The short answer: Technically, No. The math works on any scale.
The real answer: Practically, Yes. Scaling is highly recommended because Log-Cosh approximates different losses at different error scales.

2. Key Insight: Log-Cosh Behaves Differently at Different Error Magnitudes

The transition behavior:

Why this matters:

3. When does scaling help?

★ Multi-Feature Models
★ Neural Networks
★ Regularized Models
★ Understanding Transition Behavior

4. Effect of Scaling on Log-Cosh Loss

Without scaling:

# Feature range: [0, 1000]
# Errors might be: [0, 100]
# Log-Cosh treats most errors as "large" → behaves mostly like MAE
# You lose the smooth optimization benefit for small errors
# Computational cost without MSE-like benefits

With standardization:

# Features scaled to mean=0, std=1
# Errors typically in range: [-3, 3]
# Log-Cosh has balanced MSE-like (small) and MAE-like (large) regions
# Better gradient behavior across the error range
# Gets the best of both worlds

Numerical stability considerations:

# cosh(x) grows exponentially: cosh(100) = 1.3e43
# Without scaling, large errors can cause overflow in exp() calculation
# Scaling keeps errors in manageable range, typically [-5, 5]
# Prevents numerical errors and NaN values

5. Visualization of Scaling Effect

Error Range Without Scaling With Scaling
Small (< transition point) Rare if features unscaled Common - benefits from MSE-like smoothness
Large (> transition point) Most errors here Outliers only - benefits from MAE-like robustness
Behavior Mostly linear (like MAE) Balanced quadratic + linear
Optimization Good but not optimal Excellent gradient properties

6. Best Practice for Log-Cosh Loss

7. Comparison with Huber Loss

Python Code Example

import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Define Log-Cosh Loss function
def log_cosh_loss(y_true, y_pred):
    """
    Logarithm of the hyperbolic cosine of the prediction error.
    """
    error = y_pred - y_true
    return np.mean(np.log(np.cosh(error)))

# Load the tips dataset
tips = sns.load_dataset('tips')

# Add some outliers
X = tips[['total_bill']].values
y = tips['tip'].values
y_with_outliers = y.copy()
y_with_outliers[[5, 10, 15, 20]] = [25, 30, 28, 35]  # Add outliers

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y_with_outliers, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Calculate Log-Cosh Loss
log_cosh = log_cosh_loss(y_test, y_pred)
print(f"Log-Cosh Loss: {log_cosh:.4f}")

# Compare with other loss functions
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"\nComparison of loss functions:")
print(f"MSE:       {mse:.4f} (very sensitive to outliers)")
print(f"RMSE:      {rmse:.4f} (sensitive to outliers)")
print(f"MAE:       {mae:.4f} (robust to outliers)")
print(f"Log-Cosh:  {log_cosh:.4f} (balanced approach)")

# Visualize how different loss functions behave
errors = np.linspace(-5, 5, 100)

# Calculate loss for each error magnitude
mse_loss = errors ** 2
mae_loss = np.abs(errors)
log_cosh_loss_curve = np.log(np.cosh(errors))

# Plot
plt.figure(figsize=(12, 6))
plt.plot(errors, mse_loss, label='MSE (quadratic)', linewidth=2)
plt.plot(errors, mae_loss, label='MAE (linear)', linewidth=2)
plt.plot(errors, log_cosh_loss_curve, label='Log-Cosh (smooth)', linewidth=2, linestyle='--')
plt.xlabel('Prediction Error')
plt.ylabel('Loss')
plt.title('Comparison of Loss Functions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)
plt.tight_layout()
plt.show()

Output

Log-Cosh Loss: 1.3404

Comparison of loss functions:
MSE:       28.6915 (very sensitive to outliers)
RMSE:      5.3564 (sensitive to outliers)
MAE:       1.7742 (robust to outliers)
Log-Cosh:  1.3404 (balanced approach)

ML_AI/images/lc-1.png|800`