Chi-Square Test for Feature Selection
The Chi-square test is a fundamental statistical tool for evaluating relationships between categorical features and target variables in classification problems. It answers a simple but critical question: Is this feature truly associated with the outcome, or is any observed pattern just random noise?
I. The Intuitive Analogy — The Party Host
Imagine you're at a party where guests are sorted into rooms based on their favorite music: rock, jazz, classical, and hip-hop. Someone claims that shoe type (sneakers, heels, boots, sandals) is strongly connected to music preference.
The Chi-square test is like a curious host walking around with a clipboard, counting shoe types in each music room. The host asks:
"Are these shoe choices and music preferences truly linked, or is this pattern just random?"
Key observations:
- If 90% of sneaker-wearers end up in the hip-hop room → Strong relationship detected
- If shoe types are evenly distributed across all rooms → No real connection (likely random)
Translation to machine learning:
- Music rooms → Target classes (labels)
- Shoe types → Feature categories
- Host's statistical check → Chi-square test statistic
II. What is the Chi-Square Test?
The Chi-square test measures the dependence between a categorical feature and the target class.
Key Characteristics
| Property | Description |
|---|---|
| Type | Univariate filter method (evaluates one feature at a time) |
| Null Hypothesis | Feature and target are independent (no relationship) |
| Test Statistic | Chi-square ( |
| P-value | Probability of seeing this relationship by chance — lower is better |
| Data Requirements | Non-negative categorical features (counts or one-hot encoded) |
| Task Type | Classification only (not for regression) |
When to Use Chi-Square
✅ Use when:
- You have categorical features (colors, sizes, regions, etc.)
- Your target is categorical (binary or multi-class classification)
- You need a fast, model-agnostic initial screening
- You want statistical evidence of feature-target relationships
❌ Avoid when:
- Features are continuous (use correlation or mutual information instead)
- Target is continuous (use F-test or correlation for regression)
- You have very small sample sizes (expected counts < 5 per cell)
III. Step-by-Step: How Chi-Square Works
Step 1: Build the Contingency Table
Count how feature categories are distributed across target classes.
Example: Color vs. Purchase Decision
| Color | Yes (Purchased) | No (Not Purchased) | Row Total |
|---|---|---|---|
| Red | 3 | 0 | 3 |
| Blue | 0 | 2 | 2 |
| Green | 1 | 1 | 2 |
| Column Total | 4 | 3 | 7 |
Step 2: Calculate Expected Counts
If feature and target were independent, what counts would we expect?
Formula:
Expected counts:
| Color | Yes (Expected) | No (Expected) |
|---|---|---|
| Red | ||
| Blue | ||
| Green |
Step 3: Compute the Chi-Square Statistic
Measure how much observed counts deviate from expected counts.
Formula:
Where:
= Observed count in cell (i, j) = Expected count in cell (i, j) = Number of rows (feature categories) = Number of columns (target classes)
Calculations:
| Color | Yes Contribution | No Contribution |
|---|---|---|
| Red | ||
| Blue | ||
| Green |
Step 4: Determine Degrees of Freedom
Step 5: Find the Critical Value
From the Chi-square distribution table at α = 0.05 significance level with df = 2:
Step 6: Interpret the Result
- Our
= 4.963 - Critical value = 5.991
- Conclusion: Since 4.963 < 5.991, we fail to reject the null hypothesis at the 5% significance level
- Interpretation: The relationship between color and purchase decision is not statistically significant, though it's close to the threshold
💡 P-value interpretation: A p-value < 0.05 would indicate a statistically significant relationship. The closer
is to the critical value, the stronger the evidence of dependence.
IV. Implementation in Python
Complete Working Example
import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import LabelEncoder
# 1. Create sample categorical data
data = {
'Feature_Color': ['Red', 'Blue', 'Red', 'Green', 'Blue', 'Red', 'Green'],
'Feature_Size': ['L', 'S', 'L', 'S', 'S', 'L', 'L'],
'Target_Class': ['Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes']
}
df = pd.DataFrame(data)
# 2. Encode categorical variables to numeric
# Chi-square requires non-negative integer inputs
le = LabelEncoder()
df['Target_Encoded'] = le.fit_transform(df['Target_Class'])
# One-hot encode features (drop_first=True to avoid multicollinearity)
X = pd.get_dummies(df[['Feature_Color', 'Feature_Size']], drop_first=True)
y = df['Target_Encoded']
print("Encoded Features:")
print(X.head())
# 3. Apply Chi-Square Test
selector = SelectKBest(score_func=chi2, k=2) # Select top 2 features
selector.fit(X, y)
# 4. Display results
feature_scores = pd.DataFrame({
'Feature': X.columns,
'Chi2_Score': selector.scores_,
'P_Value': selector.pvalues_
}).sort_values(by='Chi2_Score', ascending=False)
print("\n--- Chi-Square Feature Scores ---")
print(feature_scores)
# 5. Select top features
selected_features = feature_scores.head(2)['Feature'].tolist()
print(f"\nTop 2 Selected Features: {selected_features}")
# 6. Transform data to keep only selected features
X_selected = selector.transform(X)
print(f"\nOriginal shape: {X.shape}")
print(f"Selected shape: {X_selected.shape}")
Sample Output:
Encoded Features:
Feature_Color_Green Feature_Color_Red Feature_Size_S
0 False True False
1 False False True
2 False True False
3 True False True
4 False False True
--- Chi-Square Feature Scores ---
Feature Chi2_Score P_Value
2 Feature_Size_S 4.000000 0.045500
1 Feature_Color_Red 2.250000 0.133614
0 Feature_Color_Green 0.041667 0.838256
Top 2 Selected Features: ['Feature_Size_S', 'Feature_Color_Red']
Original shape: (7, 3)
Selected shape: (7, 2)
Interpretation:
- Feature_Size_S has the highest score (4.0) and lowest p-value (0.046) → statistically significant at α = 0.05
- Feature_Color_Red has moderate score (2.25) but p-value (0.134) > 0.05 → not statistically significant
- Feature_Color_Green has very low score → weak relationship
V. Pros and Cons
✅ Advantages
| Advantage | Explanation |
|---|---|
| Fast and simple | Computationally lightweight — ideal for quick initial screening |
| Model-agnostic | Works independently of any ML algorithm (filter method) |
| Statistical rigor | Provides p-values for hypothesis testing |
| Interpretable | Easy to explain to non-technical stakeholders |
| Scalable | Handles large datasets efficiently |
❌ Limitations
| Limitation | Explanation |
|---|---|
| Categorical only | Cannot handle continuous features without binning |
| Univariate | Ignores feature interactions — evaluates features in isolation |
| Independence assumption | Assumes observations are independent |
| Small sample issues | Unreliable when expected counts < 5 per cell |
| No direction | Tells you there's a relationship but not its nature |
VI. Common Pitfalls and Best Practices
🚫 Pitfall 1: Using Chi-Square for Regression
Problem: Chi-square is only for classification. It cannot handle continuous target variables.
Solution: For regression, use:
- Anova F-Test (for linear relationships)
- Mutual Information (for non-linear relationships)
🚫 Pitfall 2: Forgetting to Encode Features
Problem: Feeding raw strings or floats directly to chi2() causes errors.
Solution:
# Wrong
X = df[['Color', 'Size']] # Strings — will fail
# Right
X = pd.get_dummies(df[['Color', 'Size']]) # Numeric encoding
🚫 Pitfall 3: Low Expected Counts
Problem: When expected frequency in any cell < 5, Chi-square becomes unreliable.
Solution:
- Combine rare categories
- Use Fisher's Exact Test for small samples
- Check expected counts before trusting results:
from scipy.stats import contingency
# Check expected counts
table = pd.crosstab(df['Feature'], df['Target'])
chi2, p, dof, expected = contingency.chi2_contingency(table)
print("Expected counts:\n", expected)
# All values should be ≥ 5
🚫 Pitfall 4: Ignoring P-Values
Problem: High Chi-square score doesn't always mean practical significance.
Solution:
- Always check p-value (should be < 0.05 for significance)
- Consider effect size (Cramér's V) for practical importance
from scipy.stats.contingency import association
# Calculate Cramér's V (effect size)
v = association(table, method='cramer')
print(f"Cramér's V: {v:.3f}")
# Interpretation: 0.1=small, 0.3=medium, 0.5=large effect
VII. Summary
Quick Reference Card
Chi-Square Test Checklist:
✓ Categorical features + categorical target
✓ Non-negative integer inputs (use encoding)
✓ Classification task only
✓ Check expected counts ≥ 5
✓ Interpret both score AND p-value
✓ Use as initial filter, not final selection
✓ Combine with domain knowledge
When to Use Chi-Square
| Scenario | Recommendation |
|---|---|
| Initial feature screening | ✅ Excellent choice |
| High-dimensional data | ✅ Very fast |
| Need statistical validation | ✅ Provides p-values |
| Continuous features | ❌ Use correlation or F-test |
| Regression tasks | ❌ Use F-test or mutual information |
| Feature interactions matter | ❌ Use wrapper methods (RFE) |
| Final feature selection | ⚠️ Combine with other methods |