Chi-Square Test for Feature Selection

The Chi-square test is a fundamental statistical tool for evaluating relationships between categorical features and target variables in classification problems. It answers a simple but critical question: Is this feature truly associated with the outcome, or is any observed pattern just random noise?

I. The Intuitive Analogy — The Party Host

Imagine you're at a party where guests are sorted into rooms based on their favorite music: rock, jazz, classical, and hip-hop. Someone claims that shoe type (sneakers, heels, boots, sandals) is strongly connected to music preference.

The Chi-square test is like a curious host walking around with a clipboard, counting shoe types in each music room. The host asks:

"Are these shoe choices and music preferences truly linked, or is this pattern just random?"

Key observations:

If 90% of sneaker-wearers end up in the hip-hop room → Strong relationship detected
If shoe types are evenly distributed across all rooms → No real connection (likely random)

Translation to machine learning:

Music rooms → Target classes (labels)
Shoe types → Feature categories
Host's statistical check → Chi-square test statistic

II. What is the Chi-Square Test?

The Chi-square test measures the dependence between a categorical feature and the target class.

Key Characteristics

Property	Description
Type	Univariate filter method (evaluates one feature at a time)
Null Hypothesis	Feature and target are independent (no relationship)
Test Statistic	Chi-square ( $X^{2}$ ) score — higher means stronger dependence
P-value	Probability of seeing this relationship by chance — lower is better
Data Requirements	Non-negative categorical features (counts or one-hot encoded)
Task Type	Classification only (not for regression)

When to Use Chi-Square

✅ Use when:

You have categorical features (colors, sizes, regions, etc.)
Your target is categorical (binary or multi-class classification)
You need a fast, model-agnostic initial screening
You want statistical evidence of feature-target relationships

❌ Avoid when:

Features are continuous (use correlation or mutual information instead)
Target is continuous (use F-test or correlation for regression)
You have very small sample sizes (expected counts < 5 per cell)

III. Step-by-Step: How Chi-Square Works

Step 1: Build the Contingency Table

Count how feature categories are distributed across target classes.

Example: Color vs. Purchase Decision

Color	Yes (Purchased)	No (Not Purchased)	Row Total
Red	3	0	3
Blue	0	2	2
Green	1	1	2
Column Total	4	3	7

Step 2: Calculate Expected Counts

If feature and target were independent, what counts would we expect?

Formula:

E_{i j} = \frac{(Row Total) \times (Column Total)}{Grand Total}

Expected counts:

Color	Yes (Expected)	No (Expected)
Red	$\frac{3 \times 4}{7} = 1.714$	$\frac{3 \times 3}{7} = 1.286$
Blue	$\frac{2 \times 4}{7} = 1.143$	$\frac{2 \times 3}{7} = 0.857$
Green	$\frac{2 \times 4}{7} = 1.143$	$\frac{2 \times 3}{7} = 0.857$

Step 3: Compute the Chi-Square Statistic

Measure how much observed counts deviate from expected counts.

Formula:

χ^{2} = \sum_{i = 1}^{R} \sum_{j = 1}^{C} \frac{(O_{i j} - E_{i j})^{2}}{E_{i j}}

Where:

$O_{i j}$ = Observed count in cell (i, j)
$E_{i j}$ = Expected count in cell (i, j)
$R$ = Number of rows (feature categories)
$C$ = Number of columns (target classes)

Calculations:

Color	Yes Contribution	No Contribution
Red	$\frac{(3 - 1.714)^{2}}{1.714} = 0.968$	$\frac{(0 - 1.286)^{2}}{1.286} = 1.286$
Blue	$\frac{(0 - 1.143)^{2}}{1.143} = 1.143$	$\frac{(2 - 0.857)^{2}}{0.857} = 1.524$
Green	$\frac{(1 - 1.143)^{2}}{1.143} = 0.018$	$\frac{(1 - 0.857)^{2}}{0.857} = 0.024$

χ^{2} = 0.968 + 1.286 + 1.143 + 1.524 + 0.018 + 0.024 = 4.963

Step 4: Determine Degrees of Freedom

d f = (R - 1) \times (C - 1) = (3 - 1) \times (2 - 1) = 2

Step 5: Find the Critical Value

From the Chi-square distribution table at α = 0.05 significance level with df = 2:

χ_{critical}^{2} = 5.991

Step 6: Interpret the Result

Our $X^{2}$ = 4.963
Critical value = 5.991
Conclusion: Since 4.963 < 5.991, we fail to reject the null hypothesis at the 5% significance level
Interpretation: The relationship between color and purchase decision is not statistically significant, though it's close to the threshold

💡 P-value interpretation: A p-value < 0.05 would indicate a statistically significant relationship. The closer $X^{2}$ is to the critical value, the stronger the evidence of dependence.

IV. Implementation in Python

Complete Working Example

import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import LabelEncoder

# 1. Create sample categorical data
data = {
    'Feature_Color': ['Red', 'Blue', 'Red', 'Green', 'Blue', 'Red', 'Green'],
    'Feature_Size': ['L', 'S', 'L', 'S', 'S', 'L', 'L'],
    'Target_Class': ['Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes']
}
df = pd.DataFrame(data)

# 2. Encode categorical variables to numeric
# Chi-square requires non-negative integer inputs
le = LabelEncoder()
df['Target_Encoded'] = le.fit_transform(df['Target_Class'])

# One-hot encode features (drop_first=True to avoid multicollinearity)
X = pd.get_dummies(df[['Feature_Color', 'Feature_Size']], drop_first=True)
y = df['Target_Encoded']

print("Encoded Features:")
print(X.head())

# 3. Apply Chi-Square Test
selector = SelectKBest(score_func=chi2, k=2)  # Select top 2 features
selector.fit(X, y)

# 4. Display results
feature_scores = pd.DataFrame({
    'Feature': X.columns,
    'Chi2_Score': selector.scores_,
    'P_Value': selector.pvalues_
}).sort_values(by='Chi2_Score', ascending=False)

print("\n--- Chi-Square Feature Scores ---")
print(feature_scores)

# 5. Select top features
selected_features = feature_scores.head(2)['Feature'].tolist()
print(f"\nTop 2 Selected Features: {selected_features}")

# 6. Transform data to keep only selected features
X_selected = selector.transform(X)
print(f"\nOriginal shape: {X.shape}")
print(f"Selected shape: {X_selected.shape}")

Sample Output:

Encoded Features:
   Feature_Color_Green  Feature_Color_Red  Feature_Size_S
0                False               True           False
1                False              False            True
2                False               True           False
3                 True              False            True
4                False              False            True

--- Chi-Square Feature Scores ---
               Feature  Chi2_Score   P_Value
2       Feature_Size_S    4.000000  0.045500
1    Feature_Color_Red    2.250000  0.133614
0  Feature_Color_Green    0.041667  0.838256

Top 2 Selected Features: ['Feature_Size_S', 'Feature_Color_Red']

Original shape: (7, 3)
Selected shape: (7, 2)

Interpretation:

Feature_Size_S has the highest score (4.0) and lowest p-value (0.046) → statistically significant at α = 0.05
Feature_Color_Red has moderate score (2.25) but p-value (0.134) > 0.05 → not statistically significant
Feature_Color_Green has very low score → weak relationship

V. Pros and Cons

✅ Advantages

Advantage	Explanation
Fast and simple	Computationally lightweight — ideal for quick initial screening
Model-agnostic	Works independently of any ML algorithm (filter method)
Statistical rigor	Provides p-values for hypothesis testing
Interpretable	Easy to explain to non-technical stakeholders
Scalable	Handles large datasets efficiently

❌ Limitations

Limitation	Explanation
Categorical only	Cannot handle continuous features without binning
Univariate	Ignores feature interactions — evaluates features in isolation
Independence assumption	Assumes observations are independent
Small sample issues	Unreliable when expected counts < 5 per cell
No direction	Tells you there's a relationship but not its nature

VI. Common Pitfalls and Best Practices

🚫 Pitfall 1: Using Chi-Square for Regression

Problem: Chi-square is only for classification. It cannot handle continuous target variables.

Solution: For regression, use:

Anova F-Test (for linear relationships)
Mutual Information (for non-linear relationships)

🚫 Pitfall 2: Forgetting to Encode Features

Problem: Feeding raw strings or floats directly to chi2() causes errors.

Solution:

# Wrong
X = df[['Color', 'Size']]  # Strings — will fail

# Right
X = pd.get_dummies(df[['Color', 'Size']])  # Numeric encoding

🚫 Pitfall 3: Low Expected Counts

Problem: When expected frequency in any cell < 5, Chi-square becomes unreliable.

Solution:

Combine rare categories
Use Fisher's Exact Test for small samples
Check expected counts before trusting results:

from scipy.stats import contingency

# Check expected counts
table = pd.crosstab(df['Feature'], df['Target'])
chi2, p, dof, expected = contingency.chi2_contingency(table)
print("Expected counts:\n", expected)
# All values should be ≥ 5

🚫 Pitfall 4: Ignoring P-Values

Problem: High Chi-square score doesn't always mean practical significance.

Solution:

Always check p-value (should be < 0.05 for significance)
Consider effect size (Cramér's V) for practical importance

from scipy.stats.contingency import association

# Calculate Cramér's V (effect size)
v = association(table, method='cramer')
print(f"Cramér's V: {v:.3f}")
# Interpretation: 0.1=small, 0.3=medium, 0.5=large effect

VII. Summary

Quick Reference Card

Chi-Square Test Checklist:
✓ Categorical features + categorical target
✓ Non-negative integer inputs (use encoding)
✓ Classification task only
✓ Check expected counts ≥ 5
✓ Interpret both score AND p-value
✓ Use as initial filter, not final selection
✓ Combine with domain knowledge

When to Use Chi-Square

Scenario	Recommendation
Initial feature screening	✅ Excellent choice
High-dimensional data	✅ Very fast
Need statistical validation	✅ Provides p-values
Continuous features	❌ Use correlation or F-test
Regression tasks	❌ Use F-test or mutual information
Feature interactions matter	❌ Use wrapper methods (RFE)
Final feature selection	⚠️ Combine with other methods