Feature Transformation vs Feature Scaling
Feature transformation and feature scaling are both data preprocessing techniques used in machine learning to improve model performance, but they serve different purposes.
I. Feature Transformation
- Purpose: Feature transformation is the process of modifying the distribution or structure of features to make them more suitable for a machine learning model.
- Why?: Making data normal-like for Linear Regression
- Used when:
- Feature Transformation is used when your data is skewed, has outliers, or is non-Gaussian distribution. (S.O.N)
- Data needs a new representation.
- Effect on shape of data: Alters the shape of data distribution
Common Feature Transformation Techniques
- Log Transformation
- Logit Transformation
- Power Transformer
- Yeo-Johnson Transformation
- Box-Cox Transformation
- Quantile Transformer
- Square Root Transformation (
) - Square Transformation (
) - Reciprocal Transformation (
) - Polynomial transformation
- Exponential
- Probit
- Column Transformation (new representation)
- OneHotEncoding
- DummyEncoding
- EffectEncoding
- LabelEncoding
- OrdinalEncoding
- CountEncoding
- BinaryEncoding
- OneHotEncoding
Refer Decision Tree for Transformation Selection to decide which transformation is most applicable in your case
II. Feature Scaling
- Purpose: Feature scaling ensures that all features have the same scale or range, preventing models from being biased toward features with large values.
- Why?: Ensuring equal feature importance in SVM, KNN, Neural Networks
- Used when: Feature Scaling is used when different features have varying ranges and need uniform scaling.
- Effect on shape of data: Does not alters the shape of data distribution
Common Feature Scaling Techniques
1. Normalization
2. Standardization
- StandardScaler(Standardization / Z-score Normalization)
- RobustScaler (Median & IQR-based Scaling)
➢ Quick Reference Guide
I. When to Transform vs. When to Scale
| Scenario | Action |
|---|---|
| Data is skewed | Transform first (Log, Box-Cox) then scale |
| Data is Gaussian but different scales | Scale only (StandardScaler) |
| Data has outliers | Transform (robust methods) or use RobustScaler |
| Tree-based models (RF, XGBoost) | Neither needed (optional) |
| Neural Networks | Transform if skewed + MinMaxScaler |
| Linear/Logistic Regression | Transform if skewed + StandardScaler |
| SVM, KNN | Must scale (StandardScaler or RobustScaler) |
II. Decision Tree for Transformation Selection
flowchart TD A["Is your data bounded between 0 and 1
(proportions / probabilities)?"] A -- Yes --> B["Does data contain exact 0 or 1?"] B -- No --> B1["Need odds ratios interpretation?"] B1 -- Yes --> B1A["Use LOGIT
(log-odds transformation)"] B1 -- No --> B1B["Use PROBIT
(inverse normal CDF)
Assumes normal latent variable"] B -- Yes --> B2["Use PowerTransformer
(handles boundaries)"] A -- No --> A1["Is this log-transformed data
needing reversal?"] A1 -- Yes --> A1A["Use EXPONENTIAL (e^X)
(reverse log transformation)"] A1 -- No --> C["Is your data COUNT data?
(discrete: 0,1,2,3...)"] C -- Yes --> D["Use SQUARE ROOT (√x)
(stabilizes Poisson variance)"] C -- No --> E["Is your data POSITIVE and spans
multiple orders of magnitude?"] E -- Yes --> F["Use LOG
(compresses exponential growth)"] E -- No --> G["Is your data LEFT-SKEWED
(clustered at high values)?"] G -- Yes --> G1["Is data negative or
needs exponential amplification?"] G1 -- Yes --> G1A["Use EXPONENTIAL (e^X)
(amplifies positive values)"] G1 -- No --> G1B["Use SQUARE (x²)
(corrects left skew)"] G -- No --> I["Is your data EXTREME right-skew
with meaningful inverse?"] I -- Yes --> J["Use RECIPROCAL (1/x)
(strongest compression)"] I -- No --> K["Is distribution COMPLEX,
MULTIMODAL, or UNKNOWN?"] K -- Yes --> L["Use QUANTILE TRANSFORMER
(forces any shape to normal/uniform)"] K -- No --> M["Use POWER TRANSFORMER
(auto-finds best λ)"] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style B fill:#fff3e0,stroke:#f57c00,stroke-width:2px style B1 fill:#fff3e0,stroke:#f57c00,stroke-width:2px style A1 fill:#fff3e0,stroke:#f57c00,stroke-width:2px style G1 fill:#fff3e0,stroke:#f57c00,stroke-width:2px style B1A fill:#c8e6c9,stroke:#388e3c,stroke-width:2px style B1B fill:#c8e6c9,stroke:#388e3c,stroke-width:2px style B2 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px style A1A fill:#ffccbc,stroke:#d84315,stroke-width:2px style D fill:#c8e6c9,stroke:#388e3c,stroke-width:2px style F fill:#c8e6c9,stroke:#388e3c,stroke-width:2px style G1A fill:#ffccbc,stroke:#d84315,stroke-width:2px style G1B fill:#c8e6c9,stroke:#388e3c,stroke-width:2px style J fill:#c8e6c9,stroke:#388e3c,stroke-width:2px style L fill:#c8e6c9,stroke:#388e3c,stroke-width:2px style M fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
III. Feature Transformation & Scaling: A Step-by-Step Guide
Feature Transformation and Scaling can be oveall covered in seven steps.
Step 1: Understand Your Data
Step 2: Check Data Distribution
Step 3: Identify Problems
Step 4: Choose Transformation
Step 5: Apply Transformation
Step 6: Validate Results
Step 7: Apply Scaling
flowchart LR
Start([Start:
Raw
Dataset]) --> Step1[Step 1:
Understand
Your Data]
Step1 --> Step2[Step 2:
Check
Distribution]
Step2 --> Step3[Step 3:
Identify
Problems]
Step3 --> Step4[Step 4:
Choose
Transformation]
Step4 --> Step5[Step 5:
Apply
Transformation]
Step5 --> Step6[Step 6:
Validate
Results]
Step6 --> Decision{Is Distribution
Acceptable?}
Decision -- No --> Step4
Decision -- Yes --> Step7[Step 7:
Apply
Scaling]
Step7 --> End([Ready
for
Modeling])
style Start fill:#e3f2fd,stroke:#1976d2,stroke-width:3px,color:#0d47a1
style End fill:#c8e6c9,stroke:#388e3c,stroke-width:3px,color:#1b5e20
style Decision fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100
style Step1 fill:#e1f5fe,stroke:#0288d1,stroke-width:2px
style Step2 fill:#e1f5fe,stroke:#0288d1,stroke-width:2px
style Step3 fill:#e1f5fe,stroke:#0288d1,stroke-width:2px
style Step4 fill:#fff9c4,stroke:#fbc02d,stroke-width:2px
style Step5 fill:#fff9c4,stroke:#fbc02d,stroke-width:2px
style Step6 fill:#f3e5f5,stroke:#8e24aa,stroke-width:2px
style Step7 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px