Feature Scaling
Scaling is the process of transforming the numerical values of your features so they fall within a specific range.
- The Problem: - Feature scaling is important because, when features are on vastly different scales, like a feature ranging from 1 to 10 and another from 1000 to 10000, models (like KNN or SVM) can prioritize the larger values, leading to bias in predictions. This can result in poor model performance and slower convergence during training.
- The Goal: To ensure every feature contributes equally to the model's decision-making process.
- Feature scaling addresses these issues by adjusting the range of the data without distorting differences in the values.
- Two most common methods
- Normalization
- Standardization
Normalization and Standardization
- It involves transforming the values of features in a dataset to a similar scale, ensuring that all features contribute equally to the model’s learning process.
I. Normalization
- Normalization refers to the process of adjusting values measured on different scales to a common scale.
★ Normalization is most effective in the following scenarios
- Unknown or Non-Gaussian Distribution: When the distribution of data is not known or does not follow a normal (Gaussian) pattern. For example in linear regression, we may want to normalize the dependent variable so it looks more like a bell curve, which allows for better confidence in our estimates.
- Distance-Based Algorithms: Normalization is needed when using machine learning algorithms that rely on distances between data points, such as k-Nearest Neighbors (kNN), to prevent features with larger scales from dominating the distance calculations.
II. Standardization
- Standardization, which is also called z-score scaling, transforms data to have a mean of 0 and a standard deviation of 1.
★ When should you standardize data?
- Gradient-based Algorithms: Support Vector Machine (SVM) requires standardized data for optimal performance. While models like linear regression and logistic regression do not assume standardization, they may still benefit from it, particularly when features vary widely in magnitude, helping ensure balanced contributions from each feature and improving optimization.
- Dimensionality Reduction: Standardization is in dimensionality reduction techniques like PCA because PCA identifies the direction where the variance in the data is maximized. Mean normalization alone is not sufficient because PCA considers both the mean and variance, and different feature scales would distort the analysis.
Why Normal Distribution Matters in Machine Learning
When is Normality Important?
Normal distribution (Gaussian distribution) is a fundamental assumption in many statistical and machine learning methods. Understanding whether your data follows a normal distribution helps you:
- Choose appropriate models: Some algorithms assume normally distributed features
- Apply correct transformations: Non-normal data may need transformation (log, Box-Cox, etc.)
- Validate statistical tests: Many hypothesis tests require normality
- Interpret results correctly: Normality affects confidence intervals and predictions
Models That Benefit from Normally Distributed Data
Models that ASSUME normality
- Linear Regression: Assumes residuals (errors) are normally distributed
- Logistic Regression: Works better with normally distributed features
- Linear Discriminant Analysis (LDA): Assumes features are normally distributed within each class
- Quadratic Discriminant Analysis (QDA): Similar to LDA but allows different covariance matrices
- Naive Bayes (Gaussian): Explicitly assumes features follow a Gaussian distribution
- T-tests and ANOVA: Statistical tests that require normality
Models that DON'T require normality
- Tree-based models: Decision Trees, Random Forests, Gradient Boosting (XGBoost, LightGBM)
- K-Nearest Neighbors (KNN): Distance-based, distribution-agnostic
- Support Vector Machines (SVM): Kernel-based methods don't assume normality
- Neural Networks: Can learn complex non-linear patterns without normality assumption
Note: Even when models don't strictly require normality, normalizing/standardizing features often improves convergence and performance, especially for gradient-based optimization.
Determining if a numeric feature is Gaussian (normally) distributed involves a combination of visual inspection and statistical tests. Here's a comprehensive guide:
Normal Distribution Test
I. Visual Inspection Methods
- Histograms with KDE
- Q-Q Plot
- Box Plot
- Probability Density Function (PDF) Overlay
1. Histograms with KDE (Kernel Density Estimate)
- Plot a histogram of the feature with a KDE overlay
- A Gaussian distribution will appear as a bell-shaped curve, symmetrical around the mean
- Look for skewness (asymmetry) or multiple peaks (bimodal/multimodal), which indicate non-normality
2. Q-Q Plots (Quantile-Quantile Plots)
- Compares the quantiles of your data to the quantiles of a theoretical normal distribution
- Most reliable visual test for normality
- If the data is normally distributed, points fall approximately along a straight diagonal line
- Deviations from the line indicate non-normality
3. Box Plots
- Shows the median, quartiles, and potential outliers
- While not a direct test for normality, reveals skewness and outliers
- Symmetric box with median in center suggests normality
4. Probability Density Function (PDF) Overlay
- Overlay actual data distribution with theoretical normal distribution
- Visual comparison shows how closely data matches normal curve
★ PDF's Interpretation
- Close overlap = data is approximately normal
- Visible differences = data deviates from normality
II. Statistical Tests
- Skewness and Kurtosis
- Kolmogorov-Smirnov Test
- D'Agostino-Pearson Test
- Jarque-Bera Test
- Shapiro-Wilk Test
- Anderson-Darling Test
All statistical tests for normality use the same general interpretation for p-values:
- p-value > 0.05: Data is likely normally distributed ✓
- p-value ≤ 0.05: Data is likely NOT normally distributed ✗
1. Skewness and Kurtosis
- Skewness measures the asymmetry of the distribution.
- Kurtosis measures the "tailedness" of the distribution.
- For a perfect normal distribution, skewness is 0, and kurtosis is 3.
- Significant deviations from these values indicate non-normality.
2. Kolmogorov-Smirnov Test (K-S Test)
- Compares the cumulative distribution function of your data to the cumulative distribution function of a normal distribution.
- It's more general than the Shapiro-Wilk test and can be used for larger sample sizes.
- It also returns a p-value, and the null hypothesis of normality is rejected if the p value is less than the significance level.
3. D'Agostino-Pearson Test (Omnibus Test)
- Combines skewness and kurtosis to assess normality.
- It's a good general-purpose test for normality.
- It also returns a p-value.
4. Jarque-Bera Test
- Similar to D'Agostino-Pearson, tests based on skewness and kurtosis
- Commonly used in econometrics and time series analysis
- Works well for large samples (n > 2000)
5. Shapiro-Wilk Test
- A powerful test for normality, especially for smaller sample sizes (typically < 5000).
- It calculates a test statistic (W) and a p-value.
- If the p-value is less than the significance level (e.g., 0.05), you reject the null hypothesis of normality.
6. Anderson-Darling Test
- More sensitive than K-S test, especially in the tails of the distribution
- Gives more weight to extreme values
- Provides critical values for different significance levels
Summary
I. Normalization vs Standardization
| Category | Normalization | Standardization |
|---|---|---|
| Rescaling Method | Scales data to a range (range like 0–1 or -1 to 1) based on minimum and maximum values. | Centers data around the mean (0) and scales it by the standard deviation (1). |
| Sensitivity to outliers | Highly sensitive to outliers since min and max are affected by extreme values. | Less sensitive to outliers. |
| Common Algorithms | Often applied in algorithms like k-NN, neural networks, that require data to be on a consistent scale. |
Best suited for algorithms that require features to have a common scale, such as LinearRegression, LogisticRegression, SVM, and PCA. |
| Suitability for Data | Suitable for data that does not follow a Gaussian distribution and when a bounded range is necessary. | More suitable for data with a Gaussian distribution or when maintaining zero-centered data is important. |
| Impact on Shape of Data | It may alter the shape of the data distribution, especially if there are significant outliers. | Maintains the shape of the original data distribution but aligns it to a standard scale. |
| Dependency on Distribution | Does not assume any distribution of the data. | Assumes the distribution of data is normal. |
II. Quick Summary of all Transformation and Scaling techniques
| Techniques | What it does? | What data looks like? | Data Distribution Pattern | Direction | Range | What do we want? | Works Best with | Sensitive to Outliers | Effects of Outliers | Alternates Suggested/Cons |
|---|---|---|---|---|---|---|---|---|---|---|
| Log Transformation | Applies |
➛ Positive data only ➛ Highly skewed "long tail" data. ➛ Data exhibiting exponential growth patterns |
Right-skewed, exponential | Non Linear | [0, + |
➛ Changes the unit of measurement to a logarithmic scale. ➛ Turn High Skewed data closer to a Gaussian (Normal) distribution ➛ Compress large values |
➛ Algorithms sensitive to absolute scale: Linear Regression, KNN and Gradient Boosting Models (GBMs) | No | Compresses them toward the mean. | ❌ Avoid if data contains negative values. ❌ data is already normally distributed ⚠️ May introduce bias if the data contains values zeros and near zeros. |
| Logit | Applies |
➛ Proportions/Probabilities (0 to 1) ➛ Conversion rates, market share ➛ Model probability outputs ➛ Beta-distributed data |
Bounded [0,1], S-shaped | Non Linear | ➛ Unbound [0,1] data to full real line ➛ Linearize sigmoid relationships ➛ Variance stabilization for proportions |
➛ Model stacking: Using probabilities as features ➛ Linear models with proportion inputs ➛ Beta regression |
Extreme at boundaries | Cannot handle exact 0 or 1 (undefined); needs epsilon clipping | ❌ Undefined at 0 and 1: Requires epsilon adjustment ❌ Only for [0,1] bounded data: Wrong for continuous/count data ⚠️ Interpretation: Log-odds units unintuitive |
|
| Probit ( |
Inverse normal CDF: |
➛ Proportions/Probabilities (0 to 1) ➛ Similar to Logit but assumes normal distribution ➛ Dose-response data |
Bounded [0,1], Normal-based | Non Linear | ➛ Alternative to Logit using Gaussian quantiles ➛ Unbound [0,1] data assuming normal latent variable ➛ Symmetric transformation |
➛ Probit regression models ➛ Biostatistics: Dose-response, toxicology ➛ When normal assumption justified |
Extreme at boundaries | Same as Logit: undefined at exact 0 and 1 | ❌ Undefined at 0 and 1: Needs epsilon clipping ⚠️ Similar to Logit: Practically interchangeable in most cases ⚠️ Use Logit for interpretability (odds ratios), Probit for normal assumption |
|
| Mean normalization | Subtracts mean, divides by range. | ➛ No Outliers ➛ No Sparsity |
Any distribution | Linear | [-1, 1] (approx) | ➛ When you need your data to be centered at zero (mean=0) in strict range. | ➛ Algorithms that prefer zero-centered data: Logistic Regression, Neural Network | Extreme | Same as above | ➛ ❌ Destroys Sparsity |
| Power Transformer | Maps data to Gaussian distribution. | ➛ Skewed data (not normally distributed) ➛ Bimodal data (multiple peaks) ➛ Handles both positive & negative values (Yeo-Johnson) |
Heavily skewed or bimodal | Non Linear | Not fixed | ➛ Force skewed data into bell curve ➛ Fix skewness & stabilize variance ➛ Automatic tuning of transformation parameter (λλ) |
➛ Linear Models: Linear/Logistic Regression, LDA ➛ Models requiring normally distributed inputs |
Yes | Corrects skew caused by outliers. Automatically finds optimal λλ | ❌ Computational Cost: Slower than StandardScaler ❌ Interpretability: Transformed units harder to explain ⚠️ Not needed for: Tree-based models ❌ Destroys Sparsity ⚠️ Use Log Normalization if data is simple exponential skew |
| Quantile Transformer | Maps data to a uniform or normal distribution. | ➛ Extreme outliers ➛ Complex/multimodal distributions ➛ Non-linear features ➛ High-dimensional data |
Multimodal, complex | Non Linear | [0, 1] or Normal | ➛ Flatten distribution using quantiles (percentiles) ➛ Collapse outliers into distribution edges ➛ Force any data into specific shape (uniform/normal) |
➛ Neural Networks with wildly different feature distributions ➛ When extreme outliers dominate the dataset |
Highly Robust | Collapses them into the distribution edges. Most outlier-immune scaler | ❌ Linearity Destruction: Distorts linear relationships between features ❌ Information Loss: Ranking-based, loses small differences ⚠️ Sample Size: Needs >1000 samples for stable estimates ❌ Not for: Linear Regression, small datasets |
| Square | Squares each value: |
➛ Left-skewed data (clustered at high values) ➛ Data without extreme values ➛ All real numbers (positive/negative) |
Left-skewed | Non Linear | [0, + |
➛ Amplify differences at upper range ➛ Correct left skewness ➛ Create polynomial features for interaction effects |
➛ Feature engineering for linear models ➛ When left skew needs correction ➛ Reinforcement learning reward functions |
Extreme | Magnifies them dramatically (squared effect) | ❌ Worsens right skew: Catastrophic if applied to wrong distribution ❌ Risk of overflow: Large values become computationally problematic ⚠️ Interpretation difficulty: Squared units lose intuitive meaning |
| Square Root | Takes square root: |
➛ Count data (Poisson distributed) ➛ Moderate right-skewed data ➛ Positive values only (≥0) |
Poisson, moderate right-skew | Non Linear | [0, + |
➛ Variance stabilization for count data ➛ Moderate compression of right skew ➛ Preserve zero values ( |
➛ Count data: Click counts, frequencies, transactions ➛ Moderate skew correction without over-transforming |
No | Compresses them moderately (gentler than log) | ❌ Requires non-negative values: Cannot handle negative numbers ⚠️ Partial correction: May not fully normalize heavily skewed data ⚠️ Use Log or Reciprocal for extreme skew |
| Exponential ( |
Raises e to the power of X: |
➛ Left-skewed data ➛ Negative values ➛ Log-transformed data needing reversal |
Left-skewed or log-scaled | Non Linear | ➛ Reverse log transformation ➛ Amplify positive values exponentially ➛ Convert additive to multiplicative scale |
➛ Inverse of log: When reversing log-transformed predictions ➛ Time series: Exponential growth modeling |
Extreme | Amplifies them exponentially (creates massive outliers) | ❌ Output only positive: Cannot produce negative results ❌ Extreme amplification: Small input changes create huge output changes ⚠️ Use sparingly: Typically for reversing log transformations |
|
| MinMaxScaler | Subtracts min, divides by range. | ➛ Features are bound to a fixed range ➛ Sparse Data and positive data only |
Any distribution | Linear | [0,1] | ➛ Preservation: Preserve relative distance between points ➛ Uniformity: All features have the exact same scale. |
➛ Algorithms highly dependent on the distance: KNN, Neural Network, SVM ➛ Algorithms that don't assume distribution: KNN, Neural Network |
Extreme | As we divide by range, having extreme outliers, squeezes other value in very narrow range. | 1. If data has outliers use: RobustScaler or StandardScaler 2. If you have negative numbers use: MaxAbsScaler |
| Max Abs Scaling | Scales by the absolute maximum value. | ➛ Sparse data | Any distribution | Linear | [-1,1] | ➛ Preserves Sparsity ➛ Maintains (+/-) signs |
➛ Algorithms that work well when data is centered: SVM | Extreme | They dictate the scaling range. | ➛ If you need "Normal Distribution" use: Standardization ➛ ❌ Do not Use Only when You need "Zero-Mean" data |
| StandardScaler | Subtracts mean, divides by std dev. | ➛ Gaussian (Normal) distributed data ➛ Dense data (no sparsity) ➛ Features with similar variance |
Gaussian/Normal preferred | Linear | Not fixed | ➛ Center data at mean=0 with unit variance (std=1) ➛ Make variance comparable among features ➛ Fair comparison across all features |
➛ Gradient-based algorithms: SVM, Logistic Regression, Linear Regression ➛ Dimensionality reduction: PCA ➛ Distance-based: KNN |
Resilient (Moderate) | Preserves them but centers the rest. The outlier will just have a very high Z-score | ❌ Assumes Normality: Works best with bell curve data ❌ Not Bounded: No set min/max values ❌ Destroys Sparsity ⚠️ Not needed for: Decision Trees, Random Forests (scale-invariant) |
| RobustScaler | Subtracts median, divides by IQR. | ➛ Data with significant outliers ➛ Non-normal distribution ➛ Dense data |
Any (outlier-heavy) | Linear | Not fixed | ➛ Scale data using statistics unaffected by extremes (median, IQR) ➛ Reduce influence of outliers while keeping shape |
➛ Any model where outliers are expected but should not dominate ➛ Works well with most ML algorithms |
Highly Robust | Effectively ignores their "pull" using median & IQR instead of mean & std | ⚠️ Doesn't normalize variance (unlike StandardScaler) ❌ May perform poorly for normally distributed data ❌ Inefficient for sparse data |
| Reciprocal (1/x) | Inverts values: |
➛ Extreme right-skewed data ➛ Rates/ratios with inverse meaning ➛ Time-to-event data ➛ No zeros! |
Extreme right-skew | Non Linear | ➛ Strongest compression for extreme skew ➛ Convert rates (e.g., mpg → gpm) ➛ Inverse relationships (distance/force) |
➛ Extreme outliers needing maximum compression ➛ When inverse has physical meaning ➛ Survival analysis |
No | Inverts scale: large outliers become tiny values | ❌ Cannot handle zeros: Division by zero is undefined ❌ Reverses order: Largest becomes smallest (use -1/x to preserve) ⚠️ Interpretation complexity: Reciprocal units confusing to stakeholders |