Mean normalization (mean-centering)
- Mean Normalization is a variation of Min-Max scaling that centers the data around the mean.
- Range: The resulting data will have a mean of 0, and the values will typically fall between -1 and 1.
I. Features:
- Formula
Where
II. Pros:
- Perfect Centering: The mean of your new dataset will be exactly 0.
- Bounded Range: Usually ensures the values fall in the range [−1,1].
- Intuitive: It’s very easy to explain to non-scientists: "Positive is good, negative is bad."
III. Cons:
- Sensitive to outliers: Just like MaxAbsScaling, one "crazy" high or low number will ruin the scaling for everyone else.
- Destroys Sparsity: If you had a lot of zeros (sparse data), they all become negative numbers once you subtract the mean. This is bad for memory!
- Less Popular than Standardization: In the real world, most people use StandardScaler because it handles outliers slightly better.
- Less commonly used compared to other scaling techniques.
IV. Best Use Cases:
You’ll want to reach for Mean Normalization when your machine learning model is a bit "picky" about where the data starts.
- Algorithms that love Zero: Some models (like Logistic Regression or Neural Networks) learn much faster when the data is "zero-centered." It helps the math (specifically Gradient Descent) stay stable.
- Simple Comparisons: When you want to see if a value is "above average" (positive) or "below average" (negative) at a single glance.
- Feature Scaling for Gradient Descent: If you have one feature ranging from 0 to 1,000,000 and another from 0 to 1, the model will get confused. Mean normalization puts them on the same playing field.
- When you need your data to be centered at zero but still want to maintain a strict, bounded range (unlike Standardization, which has no upper/lower bound).
- When centering data is required and rescaling to a fixed range like [−1,1] is desired.
V. When NOT to Use
- You have Sparse Data: If your data is 90% zeros (like a word-count matrix), Mean Normalization will turn all those zeros into a specific negative number. Suddenly, your computer has to remember millions of tiny numbers instead of just "zero," which can crash your program.
- You have Extreme Outliers: If you're measuring the wealth of people in a room and Bill Gates walks in, the "Range" becomes so huge that everyone else’s normalized wealth will look exactly the same (near zero).
- The Algorithm doesn't care about centering: Some models, like Decision Trees or Random Forests, don't care about the scale or the mean at all. Using this would just be extra work for no gain!
VI. Code Snippet
import numpy as np
# Example dataset
data = np.array([[10, 20], [15, 25], [30, 35], [50, 45]])
# Mean Normalization implementation
mean = np.mean(data, axis=0)
min_val = np.min(data, axis=0)
max_val = np.max(data, axis=0)
mean_normalized_data = (data - mean) / (max_val - min_val)
pd.DataFrame(mean_normalized_data)
| 0 | 1 | |
|---|---|---|
| 0 | -0.40625 | -0.45 |
| 1 | -0.28125 | -0.25 |
| 2 | 0.09375 | 0.15 |
| 3 | 0.59375 | 0.55 |
Difference from Standardization: While both center the data at zero, Mean Normalization divides by the Range, whereas Standardization divides by the Standard Deviation.