Mean Absolute Difference (MAD) for Feature Selection

The Mean Absolute Difference (MAD) is a simple yet effective unsupervised filter method for identifying and removing low-variance features. It measures how much values in a feature deviate from their mean, helping you quickly spot features that are nearly constant and therefore uninformative.

I. The Intuitive Analogy — The Classroom Test Scores

Imagine two classrooms that just received math test results:

Classroom A: Every student scored exactly 80/100

No variation → No information about individual differences
Everyone is identical in this metric

Classroom B: Scores range from 40 to 95

High variation → Clear differences between students
Some are struggling, some are excelling

Which classroom gives you more useful information about student performance? Obviously Classroom B.

This is exactly what MAD measures: How much do the values in a feature spread out from their average?

Low MAD → Feature is nearly constant (like Classroom A) → Low information content
High MAD → Feature varies significantly (like Classroom B) → Potentially useful for prediction

II. What is Mean Absolute Difference (MAD)?

MAD quantifies the average absolute distance of each value from the feature's mean.

Mathematical Definition

MAD = \frac{1}{N} \sum_{i = 1}^{N} | x_{i} - \bar{x} |

Where:

$N$ = number of samples
$x_{i}$ = individual feature value
$\bar{x}$ = mean of the feature
$| . . . |$ = absolute value (distance, always positive)

Intuitive Interpretation

MAD Value	Meaning	Action
≈ 0	Feature is nearly constant (all values are similar)	Remove — no predictive value
Small	Low variation — may be uninformative	Consider removing
Large	High variation — feature contains diverse information	Keep — potentially useful

MAD vs. Other Dispersion Measures

Measure	Formula	Sensitivity to Outliers	Use Case
MAD	Mean of \|deviations\|	Low	General-purpose, robust
Variance	Mean of squared deviations	High (squares magnify outliers)	When outliers are meaningful
Standard Deviation	√Variance	High	Common default in ML
Median Absolute Deviation	Median of \|deviations from median\|	Very low	Extremely robust to outliers

⚠️ Terminology Note: "MAD" can also refer to Median Absolute Deviation in some statistics contexts. This article discusses Mean Absolute Difference — always verify which definition is being used in your library or paper.

III. Implementation in Python

1. Numeric Features — Direct Calculation

import numpy as np
import pandas as pd

# Sample numeric data
data = pd.DataFrame({
    'age': [25, 26, 25, 25, 26, 25, 25],           # Low variance
    'income': [30, 45, 60, 80, 100, 110, 120],     # High variance
    'constant_feature': [100, 100, 100, 100, 100, 100, 100]  # Zero variance
})

# Compute MAD for each feature
mad_values = data.apply(lambda x: np.mean(np.abs(x - np.mean(x))))

print("--- MAD Values ---")
print(mad_values.sort_values(ascending=False))
print("\nFeatures to consider removing (MAD < 1.0):")
print(mad_values[mad_values < 1.0].index.tolist())

Output:

--- MAD Values ---
income              28.163265
age                  0.408163
constant_feature     0.000000
dtype: float64

Features to consider removing (MAD < 1.0):
['age', 'constant_feature']

Interpretation:

constant_feature has MAD = 0 → Perfectly constant → Remove immediately
age has MAD ≈ 0.41 → Very low variation → Likely uninformative
income has MAD ≈ 28.16 → High variation → Keep

2. Categorical Features — Frequency Encoding Approach

# Sample categorical data
df_cat = pd.DataFrame({
    'city': ['NY', 'NY', 'NY', 'LA', 'LA', 'SF', 'SF'],
    'constant_category': ['A', 'A', 'A', 'A', 'A', 'A', 'A']
})

def mad_for_categorical(series):
    """
    Compute MAD for categorical feature using frequency encoding.
    
    Strategy: Convert categories to their frequencies, then compute MAD.
    """
    # Frequency encode: replace each category with its count
    freq_encoding = series.map(series.value_counts())
    
    # Compute MAD on the encoded values
    mad = np.mean(np.abs(freq_encoding - np.mean(freq_encoding)))
    return mad

# Apply to each categorical column
mad_categorical = df_cat.apply(mad_for_categorical)
print("--- MAD for Categorical Features ---")
print(mad_categorical)

Output:

--- MAD for Categorical Features ---
city                 0.489796
constant_category    0.000000
dtype: float64

Interpretation:

constant_category has MAD = 0 → All rows are 'A' → Remove
city has MAD ≈ 0.49 → Multiple cities with varying frequencies → Keep

IV. Practical Workflow

Step-by-Step MAD-Based Feature Selection

flowchart LR
    A([Raw Features]) --> B[Compute MAD for each feature]
    B --> C{MAD > threshold?}
    C -->|Yes| D[Keep feature]
    C -->|No| E[Remove feature]
    D --> F([Selected Features])
    E --> F
    
    style A fill:#e1f5ff,stroke:#0288d1
    style F fill:#e8f5e9,stroke:#388e3c
    style E fill:#ffebee,stroke:#d32f2f
    style C fill:#fff3e0,stroke:#f57c00

Recommended Pipeline:

Calculate MAD for all numeric features
Set threshold (common values: 0.01, 0.1, or domain-specific)
Remove low-MAD features
Validate with domain knowledge
Proceed to supervised feature selection (Chi-square, RFE, etc.)

Choosing the Right Threshold

Threshold	Strictness	Use Case
0.0	Removes only perfectly constant features	Minimal filtering
0.01 - 0.1	Removes near-constant features	Common default
> 0.1	Aggressive filtering	High-dimensional data (e.g., genomics, text)
Percentile-based (e.g., bottom 10%)	Relative filtering	When unsure of absolute scale

# Example: Percentile-based threshold
mad_values = data.apply(lambda x: np.mean(np.abs(x - np.mean(x))))
threshold = mad_values.quantile(0.1)  # Remove bottom 10%
selected = mad_values[mad_values > threshold].index
print(f"10th percentile threshold: {threshold:.4f}")
print(f"Features kept: {selected.tolist()}")

V. Pros and Cons

✅ Advantages

Advantage	Explanation
Simplicity	One-line calculation in pandas/numpy
Speed	O(n) complexity — extremely fast even on large datasets
Unsupervised	No need for target labels — works on raw data
Outlier robustness	More robust than variance (no squaring of deviations)
Interpretability	Easy to explain to non-technical stakeholders
Scalability	Handles high-dimensional data efficiently

❌ Limitations

Limitation	Explanation	Mitigation
Ignores target	Doesn't consider predictive power	Combine with supervised methods (Chi-square, RFE)
Scale-dependent	Features with larger units have higher MAD	Normalize/standardize before calculating MAD
Univariate	Misses feature interactions	Use after MAD: correlation analysis or wrapper methods
Threshold selection	No universal rule for threshold value	Use domain knowledge or cross-validation
Binary features	MAD may not be meaningful for 0/1 features	Use variance threshold instead

VI. Common Pitfalls and Best Practices

🚫 Pitfall 1: Not Normalizing Features First

Problem: Comparing MAD across features with different scales is meaningless.

# Wrong: age (years) vs. income (dollars)
data = pd.DataFrame({
    'age': [25, 30, 35, 40],       # Range: ~15
    'income': [30000, 45000, 60000, 80000]  # Range: ~50,000
})
mad_raw = data.apply(lambda x: np.mean(np.abs(x - np.mean(x))))
# Income will always have higher MAD due to scale

Solution: Standardize first

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_scaled = pd.DataFrame(
    scaler.fit_transform(data),
    columns=data.columns
)
mad_scaled = data_scaled.apply(lambda x: np.mean(np.abs(x - np.mean(x))))
print(mad_scaled)  # Now comparable!

🚫 Pitfall 2: Using MAD as the Only Filter

Problem: A feature can have high MAD but be completely irrelevant to the target.

Example:

data = pd.DataFrame({
    'random_noise': np.random.randn(100),     # High MAD, zero predictive power
    'target': [0, 1] * 50
})

Solution: Use MAD as a first pass, then apply supervised methods

from sklearn.feature_selection import SelectKBest, f_classif

# Step 1: Remove near-constant features with MAD
mad_selector = MADSelector(threshold=0.01)
X_mad_filtered = mad_selector.fit_transform(X)

# Step 2: Apply supervised filter (e.g., ANOVA F-test)
supervised_selector = SelectKBest(f_classif, k=10)
X_final = supervised_selector.fit_transform(X_mad_filtered, y)

🚫 Pitfall 3: Confusing MAD Variants

Problem: "MAD" can mean different things in different contexts.

Term	Formula	Robustness	Common Use
Mean Absolute Difference	Mean(\|x - mean(x)\|)	Moderate	Feature selection (this article)
Median Absolute Deviation	Median(\|x - median(x)\|)	High	Outlier detection

Solution: Always verify which definition your library uses

# Explicitly calculate both
from scipy.stats import median_abs_deviation

mean_ad = np.mean(np.abs(data['age'] - np.mean(data['age'])))
median_ad = median_abs_deviation(data['age'])

print(f"Mean Absolute Difference: {mean_ad:.4f}")
print(f"Median Absolute Deviation: {median_ad:.4f}")

✅ Best Practices

Always standardize features before comparing MAD across different scales
Use MAD for initial screening (remove constants/near-constants), not final selection
Combine with supervised methods to ensure predictive relevance
Set threshold based on domain knowledge or via cross-validation
Document your threshold choice and rationale for reproducibility
Visualize MAD distribution to spot natural cutoff points

import matplotlib.pyplot as plt

mad_values.sort_values().plot(kind='barh', figsize=(10, 6))
plt.axvline(x=threshold, color='r', linestyle='--', label=f'Threshold: {threshold}')
plt.xlabel('MAD Value')
plt.title('Feature MAD Distribution')
plt.legend()
plt.tight_layout()
plt.show()

VII. When to Use MAD

★ Decision Matrix

Scenario	Use MAD?	Alternative
Initial data cleaning	✅ Excellent	—
High-dimensional data (1000+ features)	✅ Very fast first pass	Variance Threshold
Need to preserve variance	❌ Use variance instead	Standard Deviation, Variance Threshold
Supervised feature selection	❌ Not designed for this	Chi-square, ANOVA F-test, RFE
Features already normalized	✅ Safe to use	—
Features on different scales	⚠️ Normalize first!	StandardScaler + MAD
Heavy outliers present	✅ More robust than variance	Median Absolute Deviation (even better)
Binary features (0/1)	❌ Not meaningful	Variance Threshold

VIII. Summary

★ Quick Reference Card

MAD Feature Selection Checklist:
✓ Fast unsupervised filter for removing low-variance features
✓ More robust to outliers than variance
✓ Standardize features first if comparing across scales
✓ Use as first pass before supervised selection
✓ Common thresholds: 0.01 (strict) to 0.1 (moderate)
✓ Not a replacement for supervised methods
✓ Verify whether library uses mean or median deviation

★ Typical Workflow

from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Complete feature selection pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),                   # 1. Normalize
    ('mad_filter', MADSelector(threshold=0.01)),    # 2. Remove constants (MAD)
    ('supervised', SelectKBest(f_classif, k=20)),   # 3. Supervised filter
    ('model', RandomForestClassifier())             # 4. Final model
])

pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)