Mean Absolute Difference (MAD) for Feature Selection

The Mean Absolute Difference (MAD) is a simple yet effective unsupervised filter method for identifying and removing low-variance features. It measures how much values in a feature deviate from their mean, helping you quickly spot features that are nearly constant and therefore uninformative.

I. The Intuitive Analogy — The Classroom Test Scores

Imagine two classrooms that just received math test results:

Classroom A: Every student scored exactly 80/100

Classroom B: Scores range from 40 to 95

Which classroom gives you more useful information about student performance? Obviously Classroom B.

This is exactly what MAD measures: How much do the values in a feature spread out from their average?

II. What is Mean Absolute Difference (MAD)?

MAD quantifies the average absolute distance of each value from the feature's mean.

Mathematical Definition

MAD=1Ni=1N|xix¯|

Where:

Intuitive Interpretation

MAD Value Meaning Action
≈ 0 Feature is nearly constant (all values are similar) Remove — no predictive value
Small Low variation — may be uninformative Consider removing
Large High variation — feature contains diverse information Keep — potentially useful

MAD vs. Other Dispersion Measures

Measure Formula Sensitivity to Outliers Use Case
MAD Mean of |deviations| Low General-purpose, robust
Variance Mean of squared deviations High (squares magnify outliers) When outliers are meaningful
Standard Deviation √Variance High Common default in ML
Median Absolute Deviation Median of |deviations from median| Very low Extremely robust to outliers

⚠️ Terminology Note: "MAD" can also refer to Median Absolute Deviation in some statistics contexts. This article discusses Mean Absolute Difference — always verify which definition is being used in your library or paper.

III. Implementation in Python

1. Numeric Features — Direct Calculation

import numpy as np
import pandas as pd

# Sample numeric data
data = pd.DataFrame({
    'age': [25, 26, 25, 25, 26, 25, 25],           # Low variance
    'income': [30, 45, 60, 80, 100, 110, 120],     # High variance
    'constant_feature': [100, 100, 100, 100, 100, 100, 100]  # Zero variance
})

# Compute MAD for each feature
mad_values = data.apply(lambda x: np.mean(np.abs(x - np.mean(x))))

print("--- MAD Values ---")
print(mad_values.sort_values(ascending=False))
print("\nFeatures to consider removing (MAD < 1.0):")
print(mad_values[mad_values < 1.0].index.tolist())

Output:

--- MAD Values ---
income              28.163265
age                  0.408163
constant_feature     0.000000
dtype: float64

Features to consider removing (MAD < 1.0):
['age', 'constant_feature']

Interpretation:

2. Categorical Features — Frequency Encoding Approach

# Sample categorical data
df_cat = pd.DataFrame({
    'city': ['NY', 'NY', 'NY', 'LA', 'LA', 'SF', 'SF'],
    'constant_category': ['A', 'A', 'A', 'A', 'A', 'A', 'A']
})

def mad_for_categorical(series):
    """
    Compute MAD for categorical feature using frequency encoding.
    
    Strategy: Convert categories to their frequencies, then compute MAD.
    """
    # Frequency encode: replace each category with its count
    freq_encoding = series.map(series.value_counts())
    
    # Compute MAD on the encoded values
    mad = np.mean(np.abs(freq_encoding - np.mean(freq_encoding)))
    return mad

# Apply to each categorical column
mad_categorical = df_cat.apply(mad_for_categorical)
print("--- MAD for Categorical Features ---")
print(mad_categorical)

Output:

--- MAD for Categorical Features ---
city                 0.489796
constant_category    0.000000
dtype: float64

Interpretation:


IV. Practical Workflow

Step-by-Step MAD-Based Feature Selection

flowchart LR
    A([Raw Features]) --> B[Compute MAD for each feature]
    B --> C{MAD > threshold?}
    C -->|Yes| D[Keep feature]
    C -->|No| E[Remove feature]
    D --> F([Selected Features])
    E --> F
    
    style A fill:#e1f5ff,stroke:#0288d1
    style F fill:#e8f5e9,stroke:#388e3c
    style E fill:#ffebee,stroke:#d32f2f
    style C fill:#fff3e0,stroke:#f57c00

Recommended Pipeline:

  1. Calculate MAD for all numeric features
  2. Set threshold (common values: 0.01, 0.1, or domain-specific)
  3. Remove low-MAD features
  4. Validate with domain knowledge
  5. Proceed to supervised feature selection (Chi-square, RFE, etc.)

Choosing the Right Threshold

Threshold Strictness Use Case
0.0 Removes only perfectly constant features Minimal filtering
0.01 - 0.1 Removes near-constant features Common default
> 0.1 Aggressive filtering High-dimensional data (e.g., genomics, text)
Percentile-based (e.g., bottom 10%) Relative filtering When unsure of absolute scale
# Example: Percentile-based threshold
mad_values = data.apply(lambda x: np.mean(np.abs(x - np.mean(x))))
threshold = mad_values.quantile(0.1)  # Remove bottom 10%
selected = mad_values[mad_values > threshold].index
print(f"10th percentile threshold: {threshold:.4f}")
print(f"Features kept: {selected.tolist()}")

V. Pros and Cons

✅ Advantages

Advantage Explanation
Simplicity One-line calculation in pandas/numpy
Speed O(n) complexity — extremely fast even on large datasets
Unsupervised No need for target labels — works on raw data
Outlier robustness More robust than variance (no squaring of deviations)
Interpretability Easy to explain to non-technical stakeholders
Scalability Handles high-dimensional data efficiently

❌ Limitations

Limitation Explanation Mitigation
Ignores target Doesn't consider predictive power Combine with supervised methods (Chi-square, RFE)
Scale-dependent Features with larger units have higher MAD Normalize/standardize before calculating MAD
Univariate Misses feature interactions Use after MAD: correlation analysis or wrapper methods
Threshold selection No universal rule for threshold value Use domain knowledge or cross-validation
Binary features MAD may not be meaningful for 0/1 features Use variance threshold instead

VI. Common Pitfalls and Best Practices

🚫 Pitfall 1: Not Normalizing Features First

Problem: Comparing MAD across features with different scales is meaningless.

# Wrong: age (years) vs. income (dollars)
data = pd.DataFrame({
    'age': [25, 30, 35, 40],       # Range: ~15
    'income': [30000, 45000, 60000, 80000]  # Range: ~50,000
})
mad_raw = data.apply(lambda x: np.mean(np.abs(x - np.mean(x))))
# Income will always have higher MAD due to scale

Solution: Standardize first

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_scaled = pd.DataFrame(
    scaler.fit_transform(data),
    columns=data.columns
)
mad_scaled = data_scaled.apply(lambda x: np.mean(np.abs(x - np.mean(x))))
print(mad_scaled)  # Now comparable!

🚫 Pitfall 2: Using MAD as the Only Filter

Problem: A feature can have high MAD but be completely irrelevant to the target.

Example:

data = pd.DataFrame({
    'random_noise': np.random.randn(100),     # High MAD, zero predictive power
    'target': [0, 1] * 50
})

Solution: Use MAD as a first pass, then apply supervised methods

from sklearn.feature_selection import SelectKBest, f_classif

# Step 1: Remove near-constant features with MAD
mad_selector = MADSelector(threshold=0.01)
X_mad_filtered = mad_selector.fit_transform(X)

# Step 2: Apply supervised filter (e.g., ANOVA F-test)
supervised_selector = SelectKBest(f_classif, k=10)
X_final = supervised_selector.fit_transform(X_mad_filtered, y)

🚫 Pitfall 3: Confusing MAD Variants

Problem: "MAD" can mean different things in different contexts.

Term Formula Robustness Common Use
Mean Absolute Difference Mean(|x - mean(x)|) Moderate Feature selection (this article)
Median Absolute Deviation Median(|x - median(x)|) High Outlier detection

Solution: Always verify which definition your library uses

# Explicitly calculate both
from scipy.stats import median_abs_deviation

mean_ad = np.mean(np.abs(data['age'] - np.mean(data['age'])))
median_ad = median_abs_deviation(data['age'])

print(f"Mean Absolute Difference: {mean_ad:.4f}")
print(f"Median Absolute Deviation: {median_ad:.4f}")

✅ Best Practices

  1. Always standardize features before comparing MAD across different scales
  2. Use MAD for initial screening (remove constants/near-constants), not final selection
  3. Combine with supervised methods to ensure predictive relevance
  4. Set threshold based on domain knowledge or via cross-validation
  5. Document your threshold choice and rationale for reproducibility
  6. Visualize MAD distribution to spot natural cutoff points
import matplotlib.pyplot as plt

mad_values.sort_values().plot(kind='barh', figsize=(10, 6))
plt.axvline(x=threshold, color='r', linestyle='--', label=f'Threshold: {threshold}')
plt.xlabel('MAD Value')
plt.title('Feature MAD Distribution')
plt.legend()
plt.tight_layout()
plt.show()

VII. When to Use MAD

★ Decision Matrix

Scenario Use MAD? Alternative
Initial data cleaning ✅ Excellent
High-dimensional data (1000+ features) ✅ Very fast first pass Variance Threshold
Need to preserve variance ❌ Use variance instead Standard Deviation, Variance Threshold
Supervised feature selection ❌ Not designed for this Chi-square, ANOVA F-test, RFE
Features already normalized ✅ Safe to use
Features on different scales ⚠️ Normalize first! StandardScaler + MAD
Heavy outliers present ✅ More robust than variance Median Absolute Deviation (even better)
Binary features (0/1) ❌ Not meaningful Variance Threshold

VIII. Summary

★ Quick Reference Card

MAD Feature Selection Checklist:
✓ Fast unsupervised filter for removing low-variance features
✓ More robust to outliers than variance
✓ Standardize features first if comparing across scales
✓ Use as first pass before supervised selection
✓ Common thresholds: 0.01 (strict) to 0.1 (moderate)
✓ Not a replacement for supervised methods
✓ Verify whether library uses mean or median deviation

★ Typical Workflow

from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Complete feature selection pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),                   # 1. Normalize
    ('mad_filter', MADSelector(threshold=0.01)),    # 2. Remove constants (MAD)
    ('supervised', SelectKBest(f_classif, k=20)),   # 3. Supervised filter
    ('model', RandomForestClassifier())             # 4. Final model
])

pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)