I. Feature Transformation Summary

Techniques	What it does?	What data looks like?	Data Distribution Pattern	Direction	Range	What do we want?	Works Best with	Sensitive to Outliers	Effects of Outliers	Alternates Suggested/Cons
Log Transformation	Applies $l o g (X + c)$ to data	➛ Positive data only - Highly skewed "long tail" data. ➛ Data exhibiting exponential growth patterns	Right-skewed, exponential	Non Linear	[0, + $\infty$ ]	➛ Changes the unit of measurement to a logarithmic scale. ➛ Turn High Skewed data closer to a Gaussian (Normal) distribution ➛ Compress large values	➛ Algorithms sensitive to absolute scale: Linear Regression, KNN and Gradient Boosting Models (GBMs)	No	Compresses them toward the mean.	❌ Avoid if data contains negative values. ❌ data is already normally distributed ⚠️ May introduce bias if the data contains values zeros and near zeros.
Logit	Applies $l o g (\frac{X}{1 - X})$ (log-odds)	➛ Proportions/Probabilities (0 to 1) ➛ Conversion rates, market share ➛ Model probability outputs ➛ Beta-distributed data	Bounded [0,1], S-shaped	Non Linear	$(- \infty, + \infty)$	➛ Unbound [0,1] data to full real line ➛ Linearize sigmoid relationships ➛ Variance stabilization for proportions	➛ Model stacking: Using probabilities as features ➛ Linear models with proportion inputs ➛ Beta regression	Extreme at boundaries	Cannot handle exact 0 or 1 (undefined); needs epsilon clipping	❌ Undefined at 0 and 1: Requires epsilon adjustment ❌ Only for [0,1] bounded data: Wrong for continuous/count data ⚠️ Interpretation: Log-odds units unintuitive
Probit ( $Φ^{- 1}$ )	Inverse normal CDF: $Φ^{- 1} (X)$	➛ Proportions/Probabilities (0 to 1) ➛ Similar to Logit but assumes normal distribution ➛ Dose-response data	Bounded [0,1], Normal-based	Non Linear	$(- \infty, + \infty)$	➛ Alternative to Logit using Gaussian quantiles ➛ Unbound [0,1] data assuming normal latent variable ➛ Symmetric transformation	➛ Probit regression models ➛ Biostatistics: Dose-response, toxicology ➛ When normal assumption justified	Extreme at boundaries	Same as Logit: undefined at exact 0 and 1	❌ Undefined at 0 and 1: Needs epsilon clipping ⚠️ Similar to Logit: Practically interchangeable in most cases ⚠️ Use Logit for interpretability (odds ratios), Probit for normal assumption
Mean normalization	Subtracts mean, divides by range.	➛ No Outliers ➛ No Sparsity	Any distribution	Linear	[-1, 1] (approx)	➛ When you need your data to be centered at zero (mean=0) in strict range.	➛ Algorithms that prefer zero-centered data: Logistic Regression, Neural Network	Extreme	Same as above	➛ ❌ Destroys Sparsity
Power Transformer	Maps data to Gaussian distribution.	➛ Skewed data (not normally distributed) ➛ Bimodal data (multiple peaks) ➛ Handles both positive & negative values (Yeo-Johnson)	Heavily skewed or bimodal	Non Linear	Not fixed	➛ Force skewed data into bell curve ➛ Fix skewness & stabilize variance ➛ Automatic tuning of transformation parameter (λλ)	➛ Linear Models: Linear/Logistic Regression, LDA ➛ Models requiring normally distributed inputs	Yes	Corrects skew caused by outliers. Automatically finds optimal λλ	❌ Computational Cost: Slower than StandardScaler ❌ Interpretability: Transformed units harder to explain ⚠️ Not needed for: Tree-based models ❌ Destroys Sparsity ⚠️ Use Log Normalization if data is simple exponential skew
Quantile Transformer	Maps data to a uniform or normal distribution.	➛ Extreme outliers ➛ Complex/multimodal distributions ➛ Non-linear features ➛ High-dimensional data	Multimodal, complex	Non Linear	[0, 1] or Normal	➛ Flatten distribution using quantiles (percentiles) ➛ Collapse outliers into distribution edges ➛ Force any data into specific shape (uniform/normal)	➛ Neural Networks with wildly different feature distributions ➛ When extreme outliers dominate the dataset	Highly Robust	Collapses them into the distribution edges. Most outlier-immune scaler	❌ Linearity Destruction: Distorts linear relationships between features ❌ Information Loss: Ranking-based, loses small differences ⚠️ Sample Size: Needs >1000 samples for stable estimates ❌ Not for: Linear Regression, small datasets
Square	Squares each value: $X^{2}$	➛ Left-skewed data (clustered at high values) ➛ Data without extreme values ➛ All real numbers (positive/negative)	Left-skewed	Non Linear	[0, + $\infty$ ]	➛ Amplify differences at upper range ➛ Correct left skewness ➛ Create polynomial features for interaction effects	➛ Feature engineering for linear models ➛ When left skew needs correction ➛ Reinforcement learning reward functions	Extreme	Magnifies them dramatically (squared effect)	❌ Worsens right skew: Catastrophic if applied to wrong distribution ❌ Risk of overflow: Large values become computationally problematic ⚠️ Interpretation difficulty: Squared units lose intuitive meaning
Square Root	Takes square root: $\sqrt{X}$	➛ Count data (Poisson distributed) ➛ Moderate right-skewed data ➛ Positive values only (≥0)	Poisson, moderate right-skew	Non Linear	[0, + $\infty$ ]	➛ Variance stabilization for count data ➛ Moderate compression of right skew ➛ Preserve zero values ( $\sqrt{0} = 0$ )	➛ Count data: Click counts, frequencies, transactions ➛ Moderate skew correction without over-transforming	No	Compresses them moderately (gentler than log)	❌ Requires non-negative values: Cannot handle negative numbers ⚠️ Partial correction: May not fully normalize heavily skewed data ⚠️ Use Log or Reciprocal for extreme skew
Exponential ( $e^{X}$ )	Raises e to the power of X: $e^{X}$	➛ Left-skewed data ➛ Negative values ➛ Log-transformed data needing reversal	Left-skewed or log-scaled	Non Linear	$(0, + \infty)$	➛ Reverse log transformation ➛ Amplify positive values exponentially ➛ Convert additive to multiplicative scale	➛ Inverse of log: When reversing log-transformed predictions ➛ Time series: Exponential growth modeling	Extreme	Amplifies them exponentially (creates massive outliers)	❌ Output only positive: Cannot produce negative results ❌ Extreme amplification: Small input changes create huge output changes ⚠️ Use sparingly: Typically for reversing log transformations
Reciprocal (1/x)	Inverts values: $\frac{1}{X}$	➛ Extreme right-skewed data ➛ Rates/ratios with inverse meaning ➛ Time-to-event data ➛ No zeros!	Extreme right-skew	Non Linear	$(0, + \infty)$	➛ Strongest compression for extreme skew ➛ Convert rates (e.g., mpg → gpm) ➛ Inverse relationships (distance/force)	➛ Extreme outliers needing maximum compression ➛ When inverse has physical meaning ➛ Survival analysis	No	Inverts scale: large outliers become tiny values	❌ Cannot handle zeros: Division by zero is undefined ❌ Reverses order: Largest becomes smallest (use -1/x to preserve) ⚠️ Interpretation complexity: Reciprocal units confusing to stakeholders
Polynomial	Creates powers ( $X^{2}, X^{3}$ ) & interactions ( $X_{1} * X_{2}$ )	➛ Data with clear non-linear trends (U-shapes, S-curves) ➛ Features where effect changes with magnitude ➛ Any real numbers (positive/negative)	Any (used to model curves)	Non Linear	Depends on degree	➛ Model non-linear relationships with linear models ➛ Capture interaction effects between features ➛ Correct underfitting from simple models	➛ Linear Models: Linear/Logistic Regression, SVMs ➛ When model interpretability is important	Extreme	Magnifies them exponentially (e.g., outlier^2)	❌ High risk of Overfitting: Especially with high degrees ❌ Creates multicollinearity: $X$ and $X^{2}$ are correlated ⚠️ Requires Scaling: For regularized or distance-based models ❌ Feature Explosion: Number of features grows rapidly

II. Feature Scaling Summary

Techniques	What it does?	What data looks like?	Data Distribution Pattern	Direction	Range	What do we want?	Works Best with	Sensitive to Outliers	Effects of Outliers	Alternates Suggested/Cons
StandardScaler	Subtracts mean, divides by std dev.	➛ Gaussian (Normal) distributed data ➛ Dense data (no sparsity) ➛ Features with similar variance	Gaussian/Normal preferred	Linear	Not fixed	➛ Center data at mean=0 with unit variance (std=1) ➛ Make variance comparable among features ➛ Fair comparison across all features	➛ Gradient-based algorithms: SVM, Logistic Regression, Linear Regression ➛ Dimensionality reduction: PCA ➛ Distance-based: KNN	Resilient (Moderate)	Preserves them but centers the rest. The outlier will just have a very high Z-score	❌ Assumes Normality: Works best with bell curve data ❌ Not Bounded: No set min/max values ❌ Destroys Sparsity ⚠️ Not needed for: Decision Trees, Random Forests (scale-invariant)
RobustScaler	Subtracts median, divides by IQR.	➛ Data with significant outliers ➛ Non-normal distribution ➛ Dense data	Any (outlier-heavy)	Linear	Not fixed	➛ Scale data using statistics unaffected by extremes (median, IQR) ➛ Reduce influence of outliers while keeping shape	➛ Any model where outliers are expected but should not dominate ➛ Works well with most ML algorithms	Highly Robust	Effectively ignores their "pull" using median & IQR instead of mean & std	⚠️ Doesn't normalize variance (unlike StandardScaler) ❌ May perform poorly for normally distributed data ❌ Inefficient for sparse data
MinMaxScaler	Subtracts min, divides by range.	➛ Features are bound to a fixed range ➛ Sparse Data and positive data only	Any distribution	Linear	[0,1]	➛ Preservation: Preserve relative distance between points ➛ Uniformity: All features have the exact same scale.	➛ Algorithms highly dependent on the distance: KNN, Neural Network, SVM ➛ Algorithms that don't assume distribution: KNN, Neural Network	Extreme	As we divide by range, having extreme outliers, squeezes other value in very narrow range.	1. If data has outliers use: RobustScaler or StandardScaler 2. If you have negative numbers use: MaxAbsScaler
Max Abs Scaling	Scales by the absolute maximum value.	➛ Sparse data	Any distribution	Linear	[-1,1]	➛ Preserves Sparsity ➛ Maintains (+/-) signs	➛ Algorithms that work well when data is centered: SVM	Extreme	They dictate the scaling range.	➛ If you need "Normal Distribution" use: Standardization ➛ ❌ Do not Use Only when You need "Zero-Mean" data