I. Feature Transformation Summary
| Techniques | What it does? | What data looks like? | Data Distribution Pattern | Direction | Range | What do we want? | Works Best with | Sensitive to Outliers | Effects of Outliers | Alternates Suggested/Cons |
|---|---|---|---|---|---|---|---|---|---|---|
| Log Transformation | Applies |
➛ Positive data only - Highly skewed "long tail" data. ➛ Data exhibiting exponential growth patterns |
Right-skewed, exponential | Non Linear | [0, + |
➛ Changes the unit of measurement to a logarithmic scale. ➛ Turn High Skewed data closer to a Gaussian (Normal) distribution ➛ Compress large values |
➛ Algorithms sensitive to absolute scale: Linear Regression, KNN and Gradient Boosting Models (GBMs) | No | Compresses them toward the mean. | ❌ Avoid if data contains negative values. ❌ data is already normally distributed ⚠️ May introduce bias if the data contains values zeros and near zeros. |
| Logit | Applies |
➛ Proportions/Probabilities (0 to 1) ➛ Conversion rates, market share ➛ Model probability outputs ➛ Beta-distributed data |
Bounded [0,1], S-shaped | Non Linear | ➛ Unbound [0,1] data to full real line ➛ Linearize sigmoid relationships ➛ Variance stabilization for proportions |
➛ Model stacking: Using probabilities as features ➛ Linear models with proportion inputs ➛ Beta regression |
Extreme at boundaries | Cannot handle exact 0 or 1 (undefined); needs epsilon clipping | ❌ Undefined at 0 and 1: Requires epsilon adjustment ❌ Only for [0,1] bounded data: Wrong for continuous/count data ⚠️ Interpretation: Log-odds units unintuitive |
|
| Probit ( |
Inverse normal CDF: |
➛ Proportions/Probabilities (0 to 1) ➛ Similar to Logit but assumes normal distribution ➛ Dose-response data |
Bounded [0,1], Normal-based | Non Linear | ➛ Alternative to Logit using Gaussian quantiles ➛ Unbound [0,1] data assuming normal latent variable ➛ Symmetric transformation |
➛ Probit regression models ➛ Biostatistics: Dose-response, toxicology ➛ When normal assumption justified |
Extreme at boundaries | Same as Logit: undefined at exact 0 and 1 | ❌ Undefined at 0 and 1: Needs epsilon clipping ⚠️ Similar to Logit: Practically interchangeable in most cases ⚠️ Use Logit for interpretability (odds ratios), Probit for normal assumption |
|
| Mean normalization | Subtracts mean, divides by range. | ➛ No Outliers ➛ No Sparsity |
Any distribution | Linear | [-1, 1] (approx) | ➛ When you need your data to be centered at zero (mean=0) in strict range. | ➛ Algorithms that prefer zero-centered data: Logistic Regression, Neural Network | Extreme | Same as above | ➛ ❌ Destroys Sparsity |
| Power Transformer | Maps data to Gaussian distribution. | ➛ Skewed data (not normally distributed) ➛ Bimodal data (multiple peaks) ➛ Handles both positive & negative values (Yeo-Johnson) |
Heavily skewed or bimodal | Non Linear | Not fixed | ➛ Force skewed data into bell curve ➛ Fix skewness & stabilize variance ➛ Automatic tuning of transformation parameter (λλ) |
➛ Linear Models: Linear/Logistic Regression, LDA ➛ Models requiring normally distributed inputs |
Yes | Corrects skew caused by outliers. Automatically finds optimal λλ | ❌ Computational Cost: Slower than StandardScaler ❌ Interpretability: Transformed units harder to explain ⚠️ Not needed for: Tree-based models ❌ Destroys Sparsity ⚠️ Use Log Normalization if data is simple exponential skew |
| Quantile Transformer | Maps data to a uniform or normal distribution. | ➛ Extreme outliers ➛ Complex/multimodal distributions ➛ Non-linear features ➛ High-dimensional data |
Multimodal, complex | Non Linear | [0, 1] or Normal | ➛ Flatten distribution using quantiles (percentiles) ➛ Collapse outliers into distribution edges ➛ Force any data into specific shape (uniform/normal) |
➛ Neural Networks with wildly different feature distributions ➛ When extreme outliers dominate the dataset |
Highly Robust | Collapses them into the distribution edges. Most outlier-immune scaler | ❌ Linearity Destruction: Distorts linear relationships between features ❌ Information Loss: Ranking-based, loses small differences ⚠️ Sample Size: Needs >1000 samples for stable estimates ❌ Not for: Linear Regression, small datasets |
| Square | Squares each value: |
➛ Left-skewed data (clustered at high values) ➛ Data without extreme values ➛ All real numbers (positive/negative) |
Left-skewed | Non Linear | [0, + |
➛ Amplify differences at upper range ➛ Correct left skewness ➛ Create polynomial features for interaction effects |
➛ Feature engineering for linear models ➛ When left skew needs correction ➛ Reinforcement learning reward functions |
Extreme | Magnifies them dramatically (squared effect) | ❌ Worsens right skew: Catastrophic if applied to wrong distribution ❌ Risk of overflow: Large values become computationally problematic ⚠️ Interpretation difficulty: Squared units lose intuitive meaning |
| Square Root | Takes square root: |
➛ Count data (Poisson distributed) ➛ Moderate right-skewed data ➛ Positive values only (≥0) |
Poisson, moderate right-skew | Non Linear | [0, + |
➛ Variance stabilization for count data ➛ Moderate compression of right skew ➛ Preserve zero values ( |
➛ Count data: Click counts, frequencies, transactions ➛ Moderate skew correction without over-transforming |
No | Compresses them moderately (gentler than log) | ❌ Requires non-negative values: Cannot handle negative numbers ⚠️ Partial correction: May not fully normalize heavily skewed data ⚠️ Use Log or Reciprocal for extreme skew |
| Exponential ( |
Raises e to the power of X: |
➛ Left-skewed data ➛ Negative values ➛ Log-transformed data needing reversal |
Left-skewed or log-scaled | Non Linear | ➛ Reverse log transformation ➛ Amplify positive values exponentially ➛ Convert additive to multiplicative scale |
➛ Inverse of log: When reversing log-transformed predictions ➛ Time series: Exponential growth modeling |
Extreme | Amplifies them exponentially (creates massive outliers) | ❌ Output only positive: Cannot produce negative results ❌ Extreme amplification: Small input changes create huge output changes ⚠️ Use sparingly: Typically for reversing log transformations |
|
| Reciprocal (1/x) | Inverts values: |
➛ Extreme right-skewed data ➛ Rates/ratios with inverse meaning ➛ Time-to-event data ➛ No zeros! |
Extreme right-skew | Non Linear | ➛ Strongest compression for extreme skew ➛ Convert rates (e.g., mpg → gpm) ➛ Inverse relationships (distance/force) |
➛ Extreme outliers needing maximum compression ➛ When inverse has physical meaning ➛ Survival analysis |
No | Inverts scale: large outliers become tiny values | ❌ Cannot handle zeros: Division by zero is undefined ❌ Reverses order: Largest becomes smallest (use -1/x to preserve) ⚠️ Interpretation complexity: Reciprocal units confusing to stakeholders |
|
| Polynomial | Creates powers ( |
➛ Data with clear non-linear trends (U-shapes, S-curves) ➛ Features where effect changes with magnitude ➛ Any real numbers (positive/negative) |
Any (used to model curves) | Non Linear | Depends on degree | ➛ Model non-linear relationships with linear models ➛ Capture interaction effects between features ➛ Correct underfitting from simple models |
➛ Linear Models: Linear/Logistic Regression, SVMs ➛ When model interpretability is important |
Extreme | Magnifies them exponentially (e.g., outlier^2) | ❌ High risk of Overfitting: Especially with high degrees ❌ Creates multicollinearity: ⚠️ Requires Scaling: For regularized or distance-based models ❌ Feature Explosion: Number of features grows rapidly |
II. Feature Scaling Summary
| Techniques | What it does? | What data looks like? | Data Distribution Pattern | Direction | Range | What do we want? | Works Best with | Sensitive to Outliers | Effects of Outliers | Alternates Suggested/Cons |
|---|---|---|---|---|---|---|---|---|---|---|
| StandardScaler | Subtracts mean, divides by std dev. | ➛ Gaussian (Normal) distributed data ➛ Dense data (no sparsity) ➛ Features with similar variance |
Gaussian/Normal preferred | Linear | Not fixed | ➛ Center data at mean=0 with unit variance (std=1) ➛ Make variance comparable among features ➛ Fair comparison across all features |
➛ Gradient-based algorithms: SVM, Logistic Regression, Linear Regression ➛ Dimensionality reduction: PCA ➛ Distance-based: KNN |
Resilient (Moderate) | Preserves them but centers the rest. The outlier will just have a very high Z-score | ❌ Assumes Normality: Works best with bell curve data ❌ Not Bounded: No set min/max values ❌ Destroys Sparsity ⚠️ Not needed for: Decision Trees, Random Forests (scale-invariant) |
| RobustScaler | Subtracts median, divides by IQR. | ➛ Data with significant outliers ➛ Non-normal distribution ➛ Dense data |
Any (outlier-heavy) | Linear | Not fixed | ➛ Scale data using statistics unaffected by extremes (median, IQR) ➛ Reduce influence of outliers while keeping shape |
➛ Any model where outliers are expected but should not dominate ➛ Works well with most ML algorithms |
Highly Robust | Effectively ignores their "pull" using median & IQR instead of mean & std | ⚠️ Doesn't normalize variance (unlike StandardScaler) ❌ May perform poorly for normally distributed data ❌ Inefficient for sparse data |
| MinMaxScaler | Subtracts min, divides by range. | ➛ Features are bound to a fixed range ➛ Sparse Data and positive data only |
Any distribution | Linear | [0,1] | ➛ Preservation: Preserve relative distance between points ➛ Uniformity: All features have the exact same scale. |
➛ Algorithms highly dependent on the distance: KNN, Neural Network, SVM ➛ Algorithms that don't assume distribution: KNN, Neural Network |
Extreme | As we divide by range, having extreme outliers, squeezes other value in very narrow range. | 1. If data has outliers use: RobustScaler or StandardScaler 2. If you have negative numbers use: MaxAbsScaler |
| Max Abs Scaling | Scales by the absolute maximum value. | ➛ Sparse data | Any distribution | Linear | [-1,1] | ➛ Preserves Sparsity ➛ Maintains (+/-) signs |
➛ Algorithms that work well when data is centered: SVM | Extreme | They dictate the scaling range. | ➛ If you need "Normal Distribution" use: Standardization ➛ ❌ Do not Use Only when You need "Zero-Mean" data |