I. Feature Transformation Summary

Techniques What it does? What data looks like? Data Distribution Pattern Direction Range What do we want? Works Best with Sensitive to Outliers Effects of Outliers Alternates Suggested/Cons
Log Transformation Applies log(X+c) to data ➛ Positive data only
- Highly skewed "long tail" data.
➛ Data exhibiting exponential growth patterns
Right-skewed, exponential Non Linear [0, +] ➛ Changes the unit of measurement to a logarithmic scale.
➛ Turn High Skewed data closer to a Gaussian (Normal) distribution
➛ Compress large values
Algorithms sensitive to absolute scale: Linear Regression, KNN and Gradient Boosting Models (GBMs) No Compresses them toward the mean. ❌ Avoid if data contains negative values.
❌ data is already normally distributed
⚠️ May introduce bias if the data contains values zeros and near zeros.
Logit Applies log(X1X) (log-odds) Proportions/Probabilities (0 to 1)
➛ Conversion rates, market share
➛ Model probability outputs
➛ Beta-distributed data
Bounded [0,1], S-shaped Non Linear (,+) Unbound [0,1] data to full real line
➛ Linearize sigmoid relationships
➛ Variance stabilization for proportions
Model stacking: Using probabilities as features
Linear models with proportion inputs
Beta regression
Extreme at boundaries Cannot handle exact 0 or 1 (undefined); needs epsilon clipping Undefined at 0 and 1: Requires epsilon adjustment
Only for [0,1] bounded data: Wrong for continuous/count data
⚠️ Interpretation: Log-odds units unintuitive
Probit (Φ1) Inverse normal CDF: Φ1(X) Proportions/Probabilities (0 to 1)
➛ Similar to Logit but assumes normal distribution
➛ Dose-response data
Bounded [0,1], Normal-based Non Linear (,+) ➛ Alternative to Logit using Gaussian quantiles
➛ Unbound [0,1] data assuming normal latent variable
➛ Symmetric transformation
Probit regression models
Biostatistics: Dose-response, toxicology
➛ When normal assumption justified
Extreme at boundaries Same as Logit: undefined at exact 0 and 1 Undefined at 0 and 1: Needs epsilon clipping
⚠️ Similar to Logit: Practically interchangeable in most cases
⚠️ Use Logit for interpretability (odds ratios), Probit for normal assumption
Mean normalization Subtracts mean, divides by range. ➛ No Outliers
➛ No Sparsity
Any distribution Linear [-1, 1] (approx) ➛ When you need your data to be centered at zero (mean=0) in strict range. Algorithms that prefer zero-centered data: Logistic Regression, Neural Network Extreme Same as above ➛ ❌ Destroys Sparsity
Power Transformer Maps data to Gaussian distribution. ➛ Skewed data (not normally distributed)
➛ Bimodal data (multiple peaks)
➛ Handles both positive & negative values (Yeo-Johnson)
Heavily skewed or bimodal Non Linear Not fixed ➛ Force skewed data into bell curve
➛ Fix skewness & stabilize variance
➛ Automatic tuning of transformation parameter (λλ)
➛ Linear Models: Linear/Logistic Regression, LDA
➛ Models requiring normally distributed inputs
Yes Corrects skew caused by outliers. Automatically finds optimal λλ ❌ Computational Cost: Slower than StandardScaler
❌ Interpretability: Transformed units harder to explain
⚠️ Not needed for: Tree-based models
❌ Destroys Sparsity
⚠️ Use Log Normalization if data is simple exponential skew
Quantile Transformer Maps data to a uniform or normal distribution. ➛ Extreme outliers
➛ Complex/multimodal distributions
➛ Non-linear features
➛ High-dimensional data
Multimodal, complex Non Linear [0, 1] or Normal ➛ Flatten distribution using quantiles (percentiles)
➛ Collapse outliers into distribution edges
➛ Force any data into specific shape (uniform/normal)
➛ Neural Networks with wildly different feature distributions
➛ When extreme outliers dominate the dataset
Highly Robust Collapses them into the distribution edges. Most outlier-immune scaler ❌ Linearity Destruction: Distorts linear relationships between features
❌ Information Loss: Ranking-based, loses small differences
⚠️ Sample Size: Needs >1000 samples for stable estimates
❌ Not for: Linear Regression, small datasets
Square Squares each value: X2 ➛ Left-skewed data (clustered at high values)
➛ Data without extreme values
➛ All real numbers (positive/negative)
Left-skewed Non Linear [0, +] Amplify differences at upper range
➛ Correct left skewness
➛ Create polynomial features for interaction effects
Feature engineering for linear models
➛ When left skew needs correction
Reinforcement learning reward functions
Extreme Magnifies them dramatically (squared effect) Worsens right skew: Catastrophic if applied to wrong distribution
Risk of overflow: Large values become computationally problematic
⚠️ Interpretation difficulty: Squared units lose intuitive meaning
Square Root Takes square root: X Count data (Poisson distributed)
➛ Moderate right-skewed data
➛ Positive values only (≥0)
Poisson, moderate right-skew Non Linear [0, +] Variance stabilization for count data
➛ Moderate compression of right skew
➛ Preserve zero values (0=0)
Count data: Click counts, frequencies, transactions
Moderate skew correction without over-transforming
No Compresses them moderately (gentler than log) Requires non-negative values: Cannot handle negative numbers
⚠️ Partial correction: May not fully normalize heavily skewed data
⚠️ Use Log or Reciprocal for extreme skew
Exponential (eX) Raises e to the power of X: eX ➛ Left-skewed data
➛ Negative values
➛ Log-transformed data needing reversal
Left-skewed or log-scaled Non Linear (0,+) Reverse log transformation
➛ Amplify positive values exponentially
➛ Convert additive to multiplicative scale
Inverse of log: When reversing log-transformed predictions
Time series: Exponential growth modeling
Extreme Amplifies them exponentially (creates massive outliers) Output only positive: Cannot produce negative results
Extreme amplification: Small input changes create huge output changes
⚠️ Use sparingly: Typically for reversing log transformations
Reciprocal (1/x) Inverts values: 1X Extreme right-skewed data
➛ Rates/ratios with inverse meaning
➛ Time-to-event data
➛ No zeros!
Extreme right-skew Non Linear (0,+) Strongest compression for extreme skew
➛ Convert rates (e.g., mpg → gpm)
➛ Inverse relationships (distance/force)
Extreme outliers needing maximum compression
➛ When inverse has physical meaning
Survival analysis
No Inverts scale: large outliers become tiny values Cannot handle zeros: Division by zero is undefined
Reverses order: Largest becomes smallest (use -1/x to preserve)
⚠️ Interpretation complexity: Reciprocal units confusing to stakeholders
Polynomial Creates powers (X2,X3) & interactions (X1X2) ➛ Data with clear non-linear trends (U-shapes, S-curves)
➛ Features where effect changes with magnitude
➛ Any real numbers (positive/negative)
Any (used to model curves) Non Linear Depends on degree Model non-linear relationships with linear models
➛ Capture interaction effects between features
➛ Correct underfitting from simple models
Linear Models: Linear/Logistic Regression, SVMs
➛ When model interpretability is important
Extreme Magnifies them exponentially (e.g., outlier^2) High risk of Overfitting: Especially with high degrees
Creates multicollinearity: X and X2 are correlated
⚠️ Requires Scaling: For regularized or distance-based models
Feature Explosion: Number of features grows rapidly

II. Feature Scaling Summary

Techniques What it does? What data looks like? Data Distribution Pattern Direction Range What do we want? Works Best with Sensitive to Outliers Effects of Outliers Alternates Suggested/Cons
StandardScaler Subtracts mean, divides by std dev. ➛ Gaussian (Normal) distributed data
➛ Dense data (no sparsity)
➛ Features with similar variance
Gaussian/Normal preferred Linear Not fixed ➛ Center data at mean=0 with unit variance (std=1)
➛ Make variance comparable among features
➛ Fair comparison across all features
➛ Gradient-based algorithms: SVM, Logistic Regression, Linear Regression
➛ Dimensionality reduction: PCA
➛ Distance-based: KNN
Resilient (Moderate) Preserves them but centers the rest. The outlier will just have a very high Z-score ❌ Assumes Normality: Works best with bell curve data
❌ Not Bounded: No set min/max values
❌ Destroys Sparsity
⚠️ Not needed for: Decision Trees, Random Forests (scale-invariant)
RobustScaler Subtracts median, divides by IQR. ➛ Data with significant outliers
➛ Non-normal distribution
➛ Dense data
Any (outlier-heavy) Linear Not fixed ➛ Scale data using statistics unaffected by extremes (median, IQR)
➛ Reduce influence of outliers while keeping shape
➛ Any model where outliers are expected but should not dominate
➛ Works well with most ML algorithms
Highly Robust Effectively ignores their "pull" using median & IQR instead of mean & std ⚠️ Doesn't normalize variance (unlike StandardScaler)
❌ May perform poorly for normally distributed data
❌ Inefficient for sparse data
MinMaxScaler Subtracts min, divides by range. ➛ Features are bound to a fixed range
Sparse Data and positive data only
Any distribution Linear [0,1] Preservation: Preserve relative distance between points
Uniformity: All features have the exact same scale.
Algorithms highly dependent on the distance: KNN, Neural Network, SVM
Algorithms that don't assume distribution: KNN, Neural Network
Extreme As we divide by range, having extreme outliers, squeezes other value in very narrow range. 1. If data has outliers use: RobustScaler or StandardScaler
2. If you have negative numbers use: MaxAbsScaler
Max Abs Scaling Scales by the absolute maximum value. ➛ Sparse data Any distribution Linear [-1,1] ➛ Preserves Sparsity
➛ Maintains (+/-) signs
Algorithms that work well when data is centered: SVM Extreme They dictate the scaling range. ➛ If you need "Normal Distribution" use: Standardization
➛ ❌ Do not Use Only when You need "Zero-Mean" data