Mean Absolute Difference (MAD) for Feature Selection
The Mean Absolute Difference (MAD) is a simple yet effective unsupervised filter method for identifying and removing low-variance features. It measures how much values in a feature deviate from their mean, helping you quickly spot features that are nearly constant and therefore uninformative.
I. The Intuitive Analogy — The Classroom Test Scores
Imagine two classrooms that just received math test results:
Classroom A: Every student scored exactly 80/100
- No variation → No information about individual differences
- Everyone is identical in this metric
Classroom B: Scores range from 40 to 95
- High variation → Clear differences between students
- Some are struggling, some are excelling
Which classroom gives you more useful information about student performance? Obviously Classroom B.
This is exactly what MAD measures: How much do the values in a feature spread out from their average?
- Low MAD → Feature is nearly constant (like Classroom A) → Low information content
- High MAD → Feature varies significantly (like Classroom B) → Potentially useful for prediction
II. What is Mean Absolute Difference (MAD)?
MAD quantifies the average absolute distance of each value from the feature's mean.
Mathematical Definition
Where:
= number of samples = individual feature value = mean of the feature = absolute value (distance, always positive)
Intuitive Interpretation
| MAD Value | Meaning | Action |
|---|---|---|
| ≈ 0 | Feature is nearly constant (all values are similar) | Remove — no predictive value |
| Small | Low variation — may be uninformative | Consider removing |
| Large | High variation — feature contains diverse information | Keep — potentially useful |
MAD vs. Other Dispersion Measures
| Measure | Formula | Sensitivity to Outliers | Use Case |
|---|---|---|---|
| MAD | Mean of |deviations| | Low | General-purpose, robust |
| Variance | Mean of squared deviations | High (squares magnify outliers) | When outliers are meaningful |
| Standard Deviation | √Variance | High | Common default in ML |
| Median Absolute Deviation | Median of |deviations from median| | Very low | Extremely robust to outliers |
⚠️ Terminology Note: "MAD" can also refer to Median Absolute Deviation in some statistics contexts. This article discusses Mean Absolute Difference — always verify which definition is being used in your library or paper.
III. Implementation in Python
1. Numeric Features — Direct Calculation
import numpy as np
import pandas as pd
# Sample numeric data
data = pd.DataFrame({
'age': [25, 26, 25, 25, 26, 25, 25], # Low variance
'income': [30, 45, 60, 80, 100, 110, 120], # High variance
'constant_feature': [100, 100, 100, 100, 100, 100, 100] # Zero variance
})
# Compute MAD for each feature
mad_values = data.apply(lambda x: np.mean(np.abs(x - np.mean(x))))
print("--- MAD Values ---")
print(mad_values.sort_values(ascending=False))
print("\nFeatures to consider removing (MAD < 1.0):")
print(mad_values[mad_values < 1.0].index.tolist())
Output:
--- MAD Values ---
income 28.163265
age 0.408163
constant_feature 0.000000
dtype: float64
Features to consider removing (MAD < 1.0):
['age', 'constant_feature']
Interpretation:
- constant_feature has MAD = 0 → Perfectly constant → Remove immediately
- age has MAD ≈ 0.41 → Very low variation → Likely uninformative
- income has MAD ≈ 28.16 → High variation → Keep
2. Categorical Features — Frequency Encoding Approach
# Sample categorical data
df_cat = pd.DataFrame({
'city': ['NY', 'NY', 'NY', 'LA', 'LA', 'SF', 'SF'],
'constant_category': ['A', 'A', 'A', 'A', 'A', 'A', 'A']
})
def mad_for_categorical(series):
"""
Compute MAD for categorical feature using frequency encoding.
Strategy: Convert categories to their frequencies, then compute MAD.
"""
# Frequency encode: replace each category with its count
freq_encoding = series.map(series.value_counts())
# Compute MAD on the encoded values
mad = np.mean(np.abs(freq_encoding - np.mean(freq_encoding)))
return mad
# Apply to each categorical column
mad_categorical = df_cat.apply(mad_for_categorical)
print("--- MAD for Categorical Features ---")
print(mad_categorical)
Output:
--- MAD for Categorical Features ---
city 0.489796
constant_category 0.000000
dtype: float64
Interpretation:
- constant_category has MAD = 0 → All rows are 'A' → Remove
- city has MAD ≈ 0.49 → Multiple cities with varying frequencies → Keep
IV. Practical Workflow
Step-by-Step MAD-Based Feature Selection
flowchart LR
A([Raw Features]) --> B[Compute MAD for each feature]
B --> C{MAD > threshold?}
C -->|Yes| D[Keep feature]
C -->|No| E[Remove feature]
D --> F([Selected Features])
E --> F
style A fill:#e1f5ff,stroke:#0288d1
style F fill:#e8f5e9,stroke:#388e3c
style E fill:#ffebee,stroke:#d32f2f
style C fill:#fff3e0,stroke:#f57c00Recommended Pipeline:
- Calculate MAD for all numeric features
- Set threshold (common values: 0.01, 0.1, or domain-specific)
- Remove low-MAD features
- Validate with domain knowledge
- Proceed to supervised feature selection (Chi-square, RFE, etc.)
Choosing the Right Threshold
| Threshold | Strictness | Use Case |
|---|---|---|
| 0.0 | Removes only perfectly constant features | Minimal filtering |
| 0.01 - 0.1 | Removes near-constant features | Common default |
| > 0.1 | Aggressive filtering | High-dimensional data (e.g., genomics, text) |
| Percentile-based (e.g., bottom 10%) | Relative filtering | When unsure of absolute scale |
# Example: Percentile-based threshold
mad_values = data.apply(lambda x: np.mean(np.abs(x - np.mean(x))))
threshold = mad_values.quantile(0.1) # Remove bottom 10%
selected = mad_values[mad_values > threshold].index
print(f"10th percentile threshold: {threshold:.4f}")
print(f"Features kept: {selected.tolist()}")
V. Pros and Cons
✅ Advantages
| Advantage | Explanation |
|---|---|
| Simplicity | One-line calculation in pandas/numpy |
| Speed | O(n) complexity — extremely fast even on large datasets |
| Unsupervised | No need for target labels — works on raw data |
| Outlier robustness | More robust than variance (no squaring of deviations) |
| Interpretability | Easy to explain to non-technical stakeholders |
| Scalability | Handles high-dimensional data efficiently |
❌ Limitations
| Limitation | Explanation | Mitigation |
|---|---|---|
| Ignores target | Doesn't consider predictive power | Combine with supervised methods (Chi-square, RFE) |
| Scale-dependent | Features with larger units have higher MAD | Normalize/standardize before calculating MAD |
| Univariate | Misses feature interactions | Use after MAD: correlation analysis or wrapper methods |
| Threshold selection | No universal rule for threshold value | Use domain knowledge or cross-validation |
| Binary features | MAD may not be meaningful for 0/1 features | Use variance threshold instead |
VI. Common Pitfalls and Best Practices
🚫 Pitfall 1: Not Normalizing Features First
Problem: Comparing MAD across features with different scales is meaningless.
# Wrong: age (years) vs. income (dollars)
data = pd.DataFrame({
'age': [25, 30, 35, 40], # Range: ~15
'income': [30000, 45000, 60000, 80000] # Range: ~50,000
})
mad_raw = data.apply(lambda x: np.mean(np.abs(x - np.mean(x))))
# Income will always have higher MAD due to scale
Solution: Standardize first
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = pd.DataFrame(
scaler.fit_transform(data),
columns=data.columns
)
mad_scaled = data_scaled.apply(lambda x: np.mean(np.abs(x - np.mean(x))))
print(mad_scaled) # Now comparable!
🚫 Pitfall 2: Using MAD as the Only Filter
Problem: A feature can have high MAD but be completely irrelevant to the target.
Example:
data = pd.DataFrame({
'random_noise': np.random.randn(100), # High MAD, zero predictive power
'target': [0, 1] * 50
})
Solution: Use MAD as a first pass, then apply supervised methods
from sklearn.feature_selection import SelectKBest, f_classif
# Step 1: Remove near-constant features with MAD
mad_selector = MADSelector(threshold=0.01)
X_mad_filtered = mad_selector.fit_transform(X)
# Step 2: Apply supervised filter (e.g., ANOVA F-test)
supervised_selector = SelectKBest(f_classif, k=10)
X_final = supervised_selector.fit_transform(X_mad_filtered, y)
🚫 Pitfall 3: Confusing MAD Variants
Problem: "MAD" can mean different things in different contexts.
| Term | Formula | Robustness | Common Use |
|---|---|---|---|
| Mean Absolute Difference | Mean(|x - mean(x)|) | Moderate | Feature selection (this article) |
| Median Absolute Deviation | Median(|x - median(x)|) | High | Outlier detection |
Solution: Always verify which definition your library uses
# Explicitly calculate both
from scipy.stats import median_abs_deviation
mean_ad = np.mean(np.abs(data['age'] - np.mean(data['age'])))
median_ad = median_abs_deviation(data['age'])
print(f"Mean Absolute Difference: {mean_ad:.4f}")
print(f"Median Absolute Deviation: {median_ad:.4f}")
✅ Best Practices
- Always standardize features before comparing MAD across different scales
- Use MAD for initial screening (remove constants/near-constants), not final selection
- Combine with supervised methods to ensure predictive relevance
- Set threshold based on domain knowledge or via cross-validation
- Document your threshold choice and rationale for reproducibility
- Visualize MAD distribution to spot natural cutoff points
import matplotlib.pyplot as plt
mad_values.sort_values().plot(kind='barh', figsize=(10, 6))
plt.axvline(x=threshold, color='r', linestyle='--', label=f'Threshold: {threshold}')
plt.xlabel('MAD Value')
plt.title('Feature MAD Distribution')
plt.legend()
plt.tight_layout()
plt.show()
VII. When to Use MAD
★ Decision Matrix
| Scenario | Use MAD? | Alternative |
|---|---|---|
| Initial data cleaning | ✅ Excellent | — |
| High-dimensional data (1000+ features) | ✅ Very fast first pass | Variance Threshold |
| Need to preserve variance | ❌ Use variance instead | Standard Deviation, Variance Threshold |
| Supervised feature selection | ❌ Not designed for this | Chi-square, ANOVA F-test, RFE |
| Features already normalized | ✅ Safe to use | — |
| Features on different scales | ⚠️ Normalize first! | StandardScaler + MAD |
| Heavy outliers present | ✅ More robust than variance | Median Absolute Deviation (even better) |
| Binary features (0/1) | ❌ Not meaningful | Variance Threshold |
VIII. Summary
★ Quick Reference Card
MAD Feature Selection Checklist:
✓ Fast unsupervised filter for removing low-variance features
✓ More robust to outliers than variance
✓ Standardize features first if comparing across scales
✓ Use as first pass before supervised selection
✓ Common thresholds: 0.01 (strict) to 0.1 (moderate)
✓ Not a replacement for supervised methods
✓ Verify whether library uses mean or median deviation
★ Typical Workflow
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
# Complete feature selection pipeline
pipeline = Pipeline([
('scaler', StandardScaler()), # 1. Normalize
('mad_filter', MADSelector(threshold=0.01)), # 2. Remove constants (MAD)
('supervised', SelectKBest(f_classif, k=20)), # 3. Supervised filter
('model', RandomForestClassifier()) # 4. Final model
])
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)