Chi-Square Test for Feature Selection

The Chi-square test is a fundamental statistical tool for evaluating relationships between categorical features and target variables in classification problems. It answers a simple but critical question: Is this feature truly associated with the outcome, or is any observed pattern just random noise?

I. The Intuitive Analogy — The Party Host

Imagine you're at a party where guests are sorted into rooms based on their favorite music: rock, jazz, classical, and hip-hop. Someone claims that shoe type (sneakers, heels, boots, sandals) is strongly connected to music preference.

The Chi-square test is like a curious host walking around with a clipboard, counting shoe types in each music room. The host asks:

"Are these shoe choices and music preferences truly linked, or is this pattern just random?"

Key observations:

Translation to machine learning:

II. What is the Chi-Square Test?

The Chi-square test measures the dependence between a categorical feature and the target class.

Key Characteristics

Property Description
Type Univariate filter method (evaluates one feature at a time)
Null Hypothesis Feature and target are independent (no relationship)
Test Statistic Chi-square (X2) score — higher means stronger dependence
P-value Probability of seeing this relationship by chance — lower is better
Data Requirements Non-negative categorical features (counts or one-hot encoded)
Task Type Classification only (not for regression)

When to Use Chi-Square

Use when:

Avoid when:

III. Step-by-Step: How Chi-Square Works

Step 1: Build the Contingency Table

Count how feature categories are distributed across target classes.

Example: Color vs. Purchase Decision

Color Yes (Purchased) No (Not Purchased) Row Total
Red 3 0 3
Blue 0 2 2
Green 1 1 2
Column Total 4 3 7

Step 2: Calculate Expected Counts

If feature and target were independent, what counts would we expect?

Formula:

Eij=(Row Total)×(Column Total)Grand Total

Expected counts:

Color Yes (Expected) No (Expected)
Red 3×47=1.714 3×37=1.286
Blue 2×47=1.143 2×37=0.857
Green 2×47=1.143 2×37=0.857

Step 3: Compute the Chi-Square Statistic

Measure how much observed counts deviate from expected counts.

Formula:

χ2=i=1Rj=1C(OijEij)2Eij

Where:

Calculations:

Color Yes Contribution No Contribution
Red (31.714)21.714=0.968 (01.286)21.286=1.286
Blue (01.143)21.143=1.143 (20.857)20.857=1.524
Green (11.143)21.143=0.018 (10.857)20.857=0.024
χ2=0.968+1.286+1.143+1.524+0.018+0.024=4.963

Step 4: Determine Degrees of Freedom

df=(R1)×(C1)=(31)×(21)=2

Step 5: Find the Critical Value

From the Chi-square distribution table at α = 0.05 significance level with df = 2:

χcritical2=5.991

Step 6: Interpret the Result

💡 P-value interpretation: A p-value < 0.05 would indicate a statistically significant relationship. The closer X2 is to the critical value, the stronger the evidence of dependence.

IV. Implementation in Python

Complete Working Example

import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import LabelEncoder

# 1. Create sample categorical data
data = {
    'Feature_Color': ['Red', 'Blue', 'Red', 'Green', 'Blue', 'Red', 'Green'],
    'Feature_Size': ['L', 'S', 'L', 'S', 'S', 'L', 'L'],
    'Target_Class': ['Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes']
}
df = pd.DataFrame(data)

# 2. Encode categorical variables to numeric
# Chi-square requires non-negative integer inputs
le = LabelEncoder()
df['Target_Encoded'] = le.fit_transform(df['Target_Class'])

# One-hot encode features (drop_first=True to avoid multicollinearity)
X = pd.get_dummies(df[['Feature_Color', 'Feature_Size']], drop_first=True)
y = df['Target_Encoded']

print("Encoded Features:")
print(X.head())

# 3. Apply Chi-Square Test
selector = SelectKBest(score_func=chi2, k=2)  # Select top 2 features
selector.fit(X, y)

# 4. Display results
feature_scores = pd.DataFrame({
    'Feature': X.columns,
    'Chi2_Score': selector.scores_,
    'P_Value': selector.pvalues_
}).sort_values(by='Chi2_Score', ascending=False)

print("\n--- Chi-Square Feature Scores ---")
print(feature_scores)

# 5. Select top features
selected_features = feature_scores.head(2)['Feature'].tolist()
print(f"\nTop 2 Selected Features: {selected_features}")

# 6. Transform data to keep only selected features
X_selected = selector.transform(X)
print(f"\nOriginal shape: {X.shape}")
print(f"Selected shape: {X_selected.shape}")

Sample Output:

Encoded Features:
   Feature_Color_Green  Feature_Color_Red  Feature_Size_S
0                False               True           False
1                False              False            True
2                False               True           False
3                 True              False            True
4                False              False            True

--- Chi-Square Feature Scores ---
               Feature  Chi2_Score   P_Value
2       Feature_Size_S    4.000000  0.045500
1    Feature_Color_Red    2.250000  0.133614
0  Feature_Color_Green    0.041667  0.838256

Top 2 Selected Features: ['Feature_Size_S', 'Feature_Color_Red']

Original shape: (7, 3)
Selected shape: (7, 2)

Interpretation:

V. Pros and Cons

✅ Advantages

Advantage Explanation
Fast and simple Computationally lightweight — ideal for quick initial screening
Model-agnostic Works independently of any ML algorithm (filter method)
Statistical rigor Provides p-values for hypothesis testing
Interpretable Easy to explain to non-technical stakeholders
Scalable Handles large datasets efficiently

❌ Limitations

Limitation Explanation
Categorical only Cannot handle continuous features without binning
Univariate Ignores feature interactions — evaluates features in isolation
Independence assumption Assumes observations are independent
Small sample issues Unreliable when expected counts < 5 per cell
No direction Tells you there's a relationship but not its nature

VI. Common Pitfalls and Best Practices

🚫 Pitfall 1: Using Chi-Square for Regression

Problem: Chi-square is only for classification. It cannot handle continuous target variables.

Solution: For regression, use:

🚫 Pitfall 2: Forgetting to Encode Features

Problem: Feeding raw strings or floats directly to chi2() causes errors.

Solution:

# Wrong
X = df[['Color', 'Size']]  # Strings — will fail

# Right
X = pd.get_dummies(df[['Color', 'Size']])  # Numeric encoding

🚫 Pitfall 3: Low Expected Counts

Problem: When expected frequency in any cell < 5, Chi-square becomes unreliable.

Solution:

from scipy.stats import contingency

# Check expected counts
table = pd.crosstab(df['Feature'], df['Target'])
chi2, p, dof, expected = contingency.chi2_contingency(table)
print("Expected counts:\n", expected)
# All values should be ≥ 5

🚫 Pitfall 4: Ignoring P-Values

Problem: High Chi-square score doesn't always mean practical significance.

Solution:

from scipy.stats.contingency import association

# Calculate Cramér's V (effect size)
v = association(table, method='cramer')
print(f"Cramér's V: {v:.3f}")
# Interpretation: 0.1=small, 0.3=medium, 0.5=large effect

VII. Summary

Quick Reference Card

Chi-Square Test Checklist:
✓ Categorical features + categorical target
✓ Non-negative integer inputs (use encoding)
✓ Classification task only
✓ Check expected counts ≥ 5
✓ Interpret both score AND p-value
✓ Use as initial filter, not final selection
✓ Combine with domain knowledge

When to Use Chi-Square

Scenario Recommendation
Initial feature screening ✅ Excellent choice
High-dimensional data ✅ Very fast
Need statistical validation ✅ Provides p-values
Continuous features ❌ Use correlation or F-test
Regression tasks ❌ Use F-test or mutual information
Feature interactions matter ❌ Use wrapper methods (RFE)
Final feature selection ⚠️ Combine with other methods