Feature Scaling

Scaling is the process of transforming the numerical values of your features so they fall within a specific range.

The Problem: - Feature scaling is important because, when features are on vastly different scales, like a feature ranging from 1 to 10 and another from 1000 to 10000, models (like KNN or SVM) can prioritize the larger values, leading to bias in predictions. This can result in poor model performance and slower convergence during training.
The Goal: To ensure every feature contributes equally to the model's decision-making process.
Feature scaling addresses these issues by adjusting the range of the data without distorting differences in the values.
Two most common methods
- Normalization
- Standardization

Normalization and Standardization

It involves transforming the values of features in a dataset to a similar scale, ensuring that all features contribute equally to the model’s learning process.

I. Normalization

Normalization refers to the process of adjusting values measured on different scales to a common scale.

★ Normalization is most effective in the following scenarios

Unknown or Non-Gaussian Distribution: When the distribution of data is not known or does not follow a normal (Gaussian) pattern. For example in linear regression, we may want to normalize the dependent variable so it looks more like a bell curve, which allows for better confidence in our estimates.
Distance-Based Algorithms: Normalization is needed when using machine learning algorithms that rely on distances between data points, such as k-Nearest Neighbors (kNN), to prevent features with larger scales from dominating the distance calculations.

II. Standardization

Standardization, which is also called z-score scaling, transforms data to have a mean of 0 and a standard deviation of 1.

★ When should you standardize data?

Gradient-based Algorithms: Support Vector Machine (SVM) requires standardized data for optimal performance. While models like linear regression and logistic regression do not assume standardization, they may still benefit from it, particularly when features vary widely in magnitude, helping ensure balanced contributions from each feature and improving optimization.
Dimensionality Reduction: Standardization is in dimensionality reduction techniques like PCA because PCA identifies the direction where the variance in the data is maximized. Mean normalization alone is not sufficient because PCA considers both the mean and variance, and different feature scales would distort the analysis.

Why Normal Distribution Matters in Machine Learning

When is Normality Important?

Normal distribution (Gaussian distribution) is a fundamental assumption in many statistical and machine learning methods. Understanding whether your data follows a normal distribution helps you:

Choose appropriate models: Some algorithms assume normally distributed features
Apply correct transformations: Non-normal data may need transformation (log, Box-Cox, etc.)
Validate statistical tests: Many hypothesis tests require normality
Interpret results correctly: Normality affects confidence intervals and predictions

Models That Benefit from Normally Distributed Data

Models that ASSUME normality

Linear Regression: Assumes residuals (errors) are normally distributed
Logistic Regression: Works better with normally distributed features
Linear Discriminant Analysis (LDA): Assumes features are normally distributed within each class
Quadratic Discriminant Analysis (QDA): Similar to LDA but allows different covariance matrices
Naive Bayes (Gaussian): Explicitly assumes features follow a Gaussian distribution
T-tests and ANOVA: Statistical tests that require normality

Models that DON'T require normality

Tree-based models: Decision Trees, Random Forests, Gradient Boosting (XGBoost, LightGBM)
K-Nearest Neighbors (KNN): Distance-based, distribution-agnostic
Support Vector Machines (SVM): Kernel-based methods don't assume normality
Neural Networks: Can learn complex non-linear patterns without normality assumption

Note: Even when models don't strictly require normality, normalizing/standardizing features often improves convergence and performance, especially for gradient-based optimization.

Determining if a numeric feature is Gaussian (normally) distributed involves a combination of visual inspection and statistical tests. Here's a comprehensive guide:

Normal Distribution Test

I. Visual Inspection Methods

Histograms with KDE
Q-Q Plot
Box Plot
Probability Density Function (PDF) Overlay

1. Histograms with KDE (Kernel Density Estimate)

Plot a histogram of the feature with a KDE overlay
A Gaussian distribution will appear as a bell-shaped curve, symmetrical around the mean
Look for skewness (asymmetry) or multiple peaks (bimodal/multimodal), which indicate non-normality

2. Q-Q Plots (Quantile-Quantile Plots)

Compares the quantiles of your data to the quantiles of a theoretical normal distribution
Most reliable visual test for normality
If the data is normally distributed, points fall approximately along a straight diagonal line
Deviations from the line indicate non-normality

3. Box Plots

Shows the median, quartiles, and potential outliers
While not a direct test for normality, reveals skewness and outliers
Symmetric box with median in center suggests normality

4. Probability Density Function (PDF) Overlay

Overlay actual data distribution with theoretical normal distribution
Visual comparison shows how closely data matches normal curve

★ PDF's Interpretation

Close overlap = data is approximately normal
Visible differences = data deviates from normality

II. Statistical Tests

All statistical tests for normality use the same general interpretation for p-values:

p-value > 0.05: Data is likely normally distributed ✓
p-value ≤ 0.05: Data is likely NOT normally distributed ✗

1. Skewness and Kurtosis

Skewness measures the asymmetry of the distribution.
Kurtosis measures the "tailedness" of the distribution.
For a perfect normal distribution, skewness is 0, and kurtosis is 3.
Significant deviations from these values indicate non-normality.

2. Kolmogorov-Smirnov Test (K-S Test)

Compares the cumulative distribution function of your data to the cumulative distribution function of a normal distribution.
It's more general than the Shapiro-Wilk test and can be used for larger sample sizes.
It also returns a p-value, and the null hypothesis of normality is rejected if the p value is less than the significance level.

3. D'Agostino-Pearson Test (Omnibus Test)

Combines skewness and kurtosis to assess normality.
It's a good general-purpose test for normality.
It also returns a p-value.

4. Jarque-Bera Test

Similar to D'Agostino-Pearson, tests based on skewness and kurtosis
Commonly used in econometrics and time series analysis
Works well for large samples (n > 2000)

5. Shapiro-Wilk Test

A powerful test for normality, especially for smaller sample sizes (typically < 5000).
It calculates a test statistic (W) and a p-value.
If the p-value is less than the significance level (e.g., 0.05), you reject the null hypothesis of normality.

6. Anderson-Darling Test

More sensitive than K-S test, especially in the tails of the distribution
Gives more weight to extreme values
Provides critical values for different significance levels

Summary

I. Normalization vs Standardization

Category	Normalization	Standardization
Rescaling Method	Scales data to a range (range like 0–1 or -1 to 1) based on minimum and maximum values.	Centers data around the mean (0) and scales it by the standard deviation (1).
Sensitivity to outliers	Highly sensitive to outliers since min and max are affected by extreme values.	Less sensitive to outliers.
Common Algorithms	Often applied in algorithms like k-NN, neural networks, that require data to be on a consistent scale.	Best suited for algorithms that require features to have a common scale, such as LinearRegression, LogisticRegression, SVM, and PCA.
Suitability for Data	Suitable for data that does not follow a Gaussian distribution and when a bounded range is necessary.	More suitable for data with a Gaussian distribution or when maintaining zero-centered data is important.
Impact on Shape of Data	It may alter the shape of the data distribution, especially if there are significant outliers.	Maintains the shape of the original data distribution but aligns it to a standard scale.
Dependency on Distribution	Does not assume any distribution of the data.	Assumes the distribution of data is normal.

II. Quick Summary of all Transformation and Scaling techniques

Techniques	What it does?	What data looks like?	Data Distribution Pattern	Direction	Range	What do we want?	Works Best with	Sensitive to Outliers	Effects of Outliers	Alternates Suggested/Cons
Log Transformation	Applies $l o g (X + c)$ to data	➛ Positive data only ➛ Highly skewed "long tail" data. ➛ Data exhibiting exponential growth patterns	Right-skewed, exponential	Non Linear	[0, + $\infty$ ]	➛ Changes the unit of measurement to a logarithmic scale. ➛ Turn High Skewed data closer to a Gaussian (Normal) distribution ➛ Compress large values	➛ Algorithms sensitive to absolute scale: Linear Regression, KNN and Gradient Boosting Models (GBMs)	No	Compresses them toward the mean.	❌ Avoid if data contains negative values. ❌ data is already normally distributed ⚠️ May introduce bias if the data contains values zeros and near zeros.
Logit	Applies $l o g (\frac{X}{1 - X})$ (log-odds)	➛ Proportions/Probabilities (0 to 1) ➛ Conversion rates, market share ➛ Model probability outputs ➛ Beta-distributed data	Bounded [0,1], S-shaped	Non Linear	$(- \infty, + \infty)$	➛ Unbound [0,1] data to full real line ➛ Linearize sigmoid relationships ➛ Variance stabilization for proportions	➛ Model stacking: Using probabilities as features ➛ Linear models with proportion inputs ➛ Beta regression	Extreme at boundaries	Cannot handle exact 0 or 1 (undefined); needs epsilon clipping	❌ Undefined at 0 and 1: Requires epsilon adjustment ❌ Only for [0,1] bounded data: Wrong for continuous/count data ⚠️ Interpretation: Log-odds units unintuitive
Probit ( $Φ^{- 1}$ )	Inverse normal CDF: $Φ^{- 1} (X)$	➛ Proportions/Probabilities (0 to 1) ➛ Similar to Logit but assumes normal distribution ➛ Dose-response data	Bounded [0,1], Normal-based	Non Linear	$(- \infty, + \infty)$	➛ Alternative to Logit using Gaussian quantiles ➛ Unbound [0,1] data assuming normal latent variable ➛ Symmetric transformation	➛ Probit regression models ➛ Biostatistics: Dose-response, toxicology ➛ When normal assumption justified	Extreme at boundaries	Same as Logit: undefined at exact 0 and 1	❌ Undefined at 0 and 1: Needs epsilon clipping ⚠️ Similar to Logit: Practically interchangeable in most cases ⚠️ Use Logit for interpretability (odds ratios), Probit for normal assumption
Mean normalization	Subtracts mean, divides by range.	➛ No Outliers ➛ No Sparsity	Any distribution	Linear	[-1, 1] (approx)	➛ When you need your data to be centered at zero (mean=0) in strict range.	➛ Algorithms that prefer zero-centered data: Logistic Regression, Neural Network	Extreme	Same as above	➛ ❌ Destroys Sparsity
Power Transformer	Maps data to Gaussian distribution.	➛ Skewed data (not normally distributed) ➛ Bimodal data (multiple peaks) ➛ Handles both positive & negative values (Yeo-Johnson)	Heavily skewed or bimodal	Non Linear	Not fixed	➛ Force skewed data into bell curve ➛ Fix skewness & stabilize variance ➛ Automatic tuning of transformation parameter (λλ)	➛ Linear Models: Linear/Logistic Regression, LDA ➛ Models requiring normally distributed inputs	Yes	Corrects skew caused by outliers. Automatically finds optimal λλ	❌ Computational Cost: Slower than StandardScaler ❌ Interpretability: Transformed units harder to explain ⚠️ Not needed for: Tree-based models ❌ Destroys Sparsity ⚠️ Use Log Normalization if data is simple exponential skew
Quantile Transformer	Maps data to a uniform or normal distribution.	➛ Extreme outliers ➛ Complex/multimodal distributions ➛ Non-linear features ➛ High-dimensional data	Multimodal, complex	Non Linear	[0, 1] or Normal	➛ Flatten distribution using quantiles (percentiles) ➛ Collapse outliers into distribution edges ➛ Force any data into specific shape (uniform/normal)	➛ Neural Networks with wildly different feature distributions ➛ When extreme outliers dominate the dataset	Highly Robust	Collapses them into the distribution edges. Most outlier-immune scaler	❌ Linearity Destruction: Distorts linear relationships between features ❌ Information Loss: Ranking-based, loses small differences ⚠️ Sample Size: Needs >1000 samples for stable estimates ❌ Not for: Linear Regression, small datasets
Square	Squares each value: $X^{2}$	➛ Left-skewed data (clustered at high values) ➛ Data without extreme values ➛ All real numbers (positive/negative)	Left-skewed	Non Linear	[0, + $\infty$ ]	➛ Amplify differences at upper range ➛ Correct left skewness ➛ Create polynomial features for interaction effects	➛ Feature engineering for linear models ➛ When left skew needs correction ➛ Reinforcement learning reward functions	Extreme	Magnifies them dramatically (squared effect)	❌ Worsens right skew: Catastrophic if applied to wrong distribution ❌ Risk of overflow: Large values become computationally problematic ⚠️ Interpretation difficulty: Squared units lose intuitive meaning
Square Root	Takes square root: $\sqrt{X}$	➛ Count data (Poisson distributed) ➛ Moderate right-skewed data ➛ Positive values only (≥0)	Poisson, moderate right-skew	Non Linear	[0, + $\infty$ ]	➛ Variance stabilization for count data ➛ Moderate compression of right skew ➛ Preserve zero values ( $\sqrt{0} = 0$ )	➛ Count data: Click counts, frequencies, transactions ➛ Moderate skew correction without over-transforming	No	Compresses them moderately (gentler than log)	❌ Requires non-negative values: Cannot handle negative numbers ⚠️ Partial correction: May not fully normalize heavily skewed data ⚠️ Use Log or Reciprocal for extreme skew
Exponential ( $e^{X}$ )	Raises e to the power of X: $e^{X}$	➛ Left-skewed data ➛ Negative values ➛ Log-transformed data needing reversal	Left-skewed or log-scaled	Non Linear	$(0, + \infty)$	➛ Reverse log transformation ➛ Amplify positive values exponentially ➛ Convert additive to multiplicative scale	➛ Inverse of log: When reversing log-transformed predictions ➛ Time series: Exponential growth modeling	Extreme	Amplifies them exponentially (creates massive outliers)	❌ Output only positive: Cannot produce negative results ❌ Extreme amplification: Small input changes create huge output changes ⚠️ Use sparingly: Typically for reversing log transformations
MinMaxScaler	Subtracts min, divides by range.	➛ Features are bound to a fixed range ➛ Sparse Data and positive data only	Any distribution	Linear	[0,1]	➛ Preservation: Preserve relative distance between points ➛ Uniformity: All features have the exact same scale.	➛ Algorithms highly dependent on the distance: KNN, Neural Network, SVM ➛ Algorithms that don't assume distribution: KNN, Neural Network	Extreme	As we divide by range, having extreme outliers, squeezes other value in very narrow range.	1. If data has outliers use: RobustScaler or StandardScaler 2. If you have negative numbers use: MaxAbsScaler
Max Abs Scaling	Scales by the absolute maximum value.	➛ Sparse data	Any distribution	Linear	[-1,1]	➛ Preserves Sparsity ➛ Maintains (+/-) signs	➛ Algorithms that work well when data is centered: SVM	Extreme	They dictate the scaling range.	➛ If you need "Normal Distribution" use: Standardization ➛ ❌ Do not Use Only when You need "Zero-Mean" data
StandardScaler	Subtracts mean, divides by std dev.	➛ Gaussian (Normal) distributed data ➛ Dense data (no sparsity) ➛ Features with similar variance	Gaussian/Normal preferred	Linear	Not fixed	➛ Center data at mean=0 with unit variance (std=1) ➛ Make variance comparable among features ➛ Fair comparison across all features	➛ Gradient-based algorithms: SVM, Logistic Regression, Linear Regression ➛ Dimensionality reduction: PCA ➛ Distance-based: KNN	Resilient (Moderate)	Preserves them but centers the rest. The outlier will just have a very high Z-score	❌ Assumes Normality: Works best with bell curve data ❌ Not Bounded: No set min/max values ❌ Destroys Sparsity ⚠️ Not needed for: Decision Trees, Random Forests (scale-invariant)
RobustScaler	Subtracts median, divides by IQR.	➛ Data with significant outliers ➛ Non-normal distribution ➛ Dense data	Any (outlier-heavy)	Linear	Not fixed	➛ Scale data using statistics unaffected by extremes (median, IQR) ➛ Reduce influence of outliers while keeping shape	➛ Any model where outliers are expected but should not dominate ➛ Works well with most ML algorithms	Highly Robust	Effectively ignores their "pull" using median & IQR instead of mean & std	⚠️ Doesn't normalize variance (unlike StandardScaler) ❌ May perform poorly for normally distributed data ❌ Inefficient for sparse data
Reciprocal (1/x)	Inverts values: $\frac{1}{X}$	➛ Extreme right-skewed data ➛ Rates/ratios with inverse meaning ➛ Time-to-event data ➛ No zeros!	Extreme right-skew	Non Linear	$(0, + \infty)$	➛ Strongest compression for extreme skew ➛ Convert rates (e.g., mpg → gpm) ➛ Inverse relationships (distance/force)	➛ Extreme outliers needing maximum compression ➛ When inverse has physical meaning ➛ Survival analysis	No	Inverts scale: large outliers become tiny values	❌ Cannot handle zeros: Division by zero is undefined ❌ Reverses order: Largest becomes smallest (use -1/x to preserve) ⚠️ Interpretation complexity: Reciprocal units confusing to stakeholders