Power Transformer

A Power Transformer is a data preprocessing tool that uses mathematical power functions to transform data from any distribution—especially skewed or bimodal distributions—into a Gaussian (Normal) distribution. This process stabilizes the variance of the features and makes the data more "digestible" for machine learning models that assume normality.


---
config:
  theme: 'base'
  layout: 'tidy-tree'
  fontSize: 5
  font-family: '"Gill Sans", sans-serif'
---
mindmap
	root(PowerTransformer)
		Do Not Use When
			TreeBased Models
			Sparse Data
			Mild Skewness
			Sensitive to Outliers
			Strict Range requirements
		Use when
			Heavily Skewed
			Bimodal
			Need Gaussian distribution
		Strategies
			Yeo-Johnson 
			Box-Cox

I. Features

II. Best Use Case

III. When NOT to Use It

IV. Pros

V. Cons

Flash Cards

★ Feature Transformation
✈ skewed data ✈ Bimodal distributions ✈ Heteroscedastic data
✅ Standardization ✅ Gaussian Distribution ✅ Linear Models
🚫 Destroys Sparsity 🚫 Sensitive to Outliers

VI. Code Snippet

★ Box-Cox Transformation
from scipy.stats import boxcox

# Only for positive values
df['boxcox_feature'], lambda_param = boxcox(df['feature'])
★ Yeo-Johnson Transformation
from sklearn.preprocessing import PowerTransformer

# Handles negative values
pt = PowerTransformer(method='yeo-johnson')
df['yeo_johnson_feature'] = pt.fit_transform(df[['feature']])

Practical Implementation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Loading Dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

# Convert to Dataframe and target
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

data = np.array(X['Population'])

# Applying Power Transformation
pt = PowerTransformer(method='yeo-johnson', standardize=True)
log_data = pt.fit_transform(data.reshape(-1,1))

# flatten log_data for plotting, and set titles on correct axes
log_data_flat = np.array(log_data).flatten()

# Create subplots
fig, axes = plt.subplots(1, 4, figsize=(18, 4))

# Original Data KDE Plot
sns.kdeplot(data, ax=axes[0])
axes[0].set_title('Original Data PDF')

# Original Data QQ Plot
stats.probplot(data, dist='norm', plot=axes[1])
axes[1].set_title('QQ Plot: Original Data')

# PowerTransformed Data KDE Plot
sns.kdeplot(log_data_flat, ax=axes[2])
axes[2].set_title('PowerTransformed Data PDF')

# Power-Transformed Data QQ Plot
stats.probplot(log_data_flat, dist='norm', plot=axes[3])
axes[3].set_title('QQ Plot: PowerTransformed Data')

# Adjust layout and display
plt.tight_layout()
plt.show()

log-2.png