Fisher's Score for Feature Selection
Fisher's Score is a powerful statistical method for identifying features that best separate different classes or predict continuous outcomes. It answers a fundamental question: Which features create the clearest distinction between groups?
I. The Intuitive Analogy — The Sports Coach
Imagine you're a coach selecting players for a championship team. You have data on hundreds of athletes — height, speed, stamina, accuracy, reaction time, and more. But which attributes truly matter?
You notice:
- Speed and accuracy clearly separate elite players from average ones
- Height and age show little difference between performance levels
A good coach focuses on the metrics that distinguish winners from the rest. Fisher's Score does exactly this in machine learning — it identifies features that best separate classes or predict outcomes.
The key insight: Just as a coach looks for skills that differentiate great players (high between-group difference, low within-group variation), Fisher's Score finds features where:
- Class means are far apart (high between-class variance)
- Data within each class is tightly clustered (low within-class variance)
II. What is Fisher's Score?
Fisher's Score is a supervised, univariate filter method used primarily for classification, though it extends to regression through the F-statistic framework.
Core Concept
It measures how well a feature discriminates between classes by comparing:
- Between-class variance — How far apart are the class means?
- Within-class variance — How spread out are samples within each class?
Intuitive Interpretation
Think of two teams warming up before a game:
| Scenario | Between-Team Distance | Within-Team Organization | Fisher's Score | Can You Tell Them Apart? |
|---|---|---|---|---|
| High Score | Teams far apart | Each team tightly grouped | High | ✅ Easy — clear distinction |
| Low Score | Teams mixed together | Players scattered everywhere | Low | ❌ Difficult — lots of overlap |
A high Fisher's Score means the feature creates clear, well-separated clusters — exactly what we want for classification.
III. Mathematical Foundation
For Binary/Multi-Class Classification
For a feature
Where:
— Number of samples in class — Mean of feature in class — Overall mean of feature — Variance of feature within class
Numerator (Between-class variance): Weighted sum of squared distances between class means and global mean
Denominator (Within-class variance): Weighted sum of variances within each class
💡 Higher
values → Feature separates classes better → More useful for prediction
Connection to ANOVA F-Test
Fisher's Score is mathematically equivalent to the F-statistic from ANOVA (Analysis of Variance):
This gives us two outputs:
- F-statistic (Fisher's Score) — Effect size (magnitude of separation)
- p-value — Statistical significance (is this separation real or random?)
IV. Implementation in Python
1. Classification Example — Player Performance
Classifying athletes as "Amateur" or "Pro" based on speed and accuracy.
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
# Sample data: athlete performance metrics
data = {
'speed': [6.1, 6.3, 6.8, 7.5, 8.2, 8.6, 9.1, 9.3],
'accuracy': [60, 65, 63, 70, 85, 87, 90, 92],
'stamina': [50, 55, 58, 60, 65, 68, 70, 72],
'label': ['Amateur', 'Amateur', 'Amateur', 'Amateur',
'Pro', 'Pro', 'Pro', 'Pro']
}
df = pd.DataFrame(data)
# Encode target
df['label_encoded'] = df['label'].map({'Amateur': 0, 'Pro': 1})
# Separate features and target
X = df[['speed', 'accuracy', 'stamina']]
y = df['label_encoded']
# Compute Fisher's Score via ANOVA F-test
f_scores, p_values = f_classif(X, y)
# Create results dataframe
results = pd.DataFrame({
'Feature': X.columns,
'Fisher_Score': f_scores,
'P_Value': p_values
}).sort_values('Fisher_Score', ascending=False)
print("--- Fisher's Score Ranking ---")
print(results)
print("\nInterpretation:")
for idx, row in results.iterrows():
sig = "✅ Significant" if row['P_Value'] < 0.05 else "❌ Not significant"
print(f"{row['Feature']:12s}: F={row['Fisher_Score']:6.2f}, p={row['P_Value']:.4f} {sig}")
Sample Output:
--- Fisher's Score Ranking ---
Feature Fisher_Score P_Value
1 accuracy 84.292683 0.000051
0 speed 28.407602 0.001575
2 stamina 12.500000 0.012345
Interpretation:
accuracy : F= 84.29, p=0.0001 ✅ Significant
speed : F= 28.41, p=0.0016 ✅ Significant
stamina : F= 12.50, p=0.0123 ✅ Significant
Interpretation:
- accuracy has the highest Fisher's Score (84.29) → Best discriminator between Amateur/Pro
- speed is second (28.41) → Also useful but less distinctive
- stamina is weakest (12.50) → Still significant but lowest separation power
2. Feature Selection Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Build pipeline with Fisher's Score selection
pipeline = Pipeline([
('scaler', StandardScaler()), # 1. Normalize features
('fisher_select', SelectKBest(f_classif, k=2)), # 2. Keep top 2 by Fisher's Score
('classifier', LogisticRegression()) # 3. Train model
])
# Train
pipeline.fit(X_train, y_train)
# Evaluate
score = pipeline.score(X_test, y_test)
print(f"\nModel Accuracy: {score:.2%}")
# Get selected features
selector = pipeline.named_steps['fisher_select']
selected_features = X.columns[selector.get_support()].tolist()
print(f"Selected Features: {selected_features}")
3. Regression Example — Marketing Campaign
Predicting sales based on marketing spend across channels.
from sklearn.feature_selection import f_regression
# Sample data: marketing channels vs sales
data_reg = {
'social_media_spend': [20, 25, 30, 40, 50, 60, 70, 80],
'tv_spend': [30, 35, 45, 55, 65, 70, 80, 90],
'email_spend': [5, 10, 12, 15, 18, 20, 22, 25],
'sales': [100, 120, 150, 200, 250, 280, 330, 400]
}
df_reg = pd.DataFrame(data_reg)
X_reg = df_reg[['social_media_spend', 'tv_spend', 'email_spend']]
y_reg = df_reg['sales']
# Compute F-statistic for regression
f_scores_reg, p_values_reg = f_regression(X_reg, y_reg)
results_reg = pd.DataFrame({
'Feature': X_reg.columns,
'F_Score': f_scores_reg,
'P_Value': p_values_reg
}).sort_values('F_Score', ascending=False)
print("--- Fisher's Score for Regression ---")
print(results_reg)
Sample Output:
--- Fisher's Score for Regression ---
Feature F_Score P_Value
0 social_media_spend 1015.560006 0.000000
1 tv_spend 526.177617 0.000001
2 email_spend 180.234521 0.000045
Interpretation:
- social_media_spend has the strongest linear relationship with sales
- tv_spend is second
- email_spend shows weaker (but still significant) correlation
V. When to Use Fisher's Score
✅ Use Fisher's Score When:
| Scenario | Why It Works |
|---|---|
| Classification tasks | Designed to measure class separation |
| Linear relationships | Assumes linear separability between classes |
| Fast initial screening | Computationally cheap filter method |
| Statistical validation needed | Provides p-values for significance testing |
| Continuous features | Works natively with numeric data |
| Need feature ranking | Gives interpretable scores for comparison |
❌ Avoid Fisher's Score When:
| Scenario | Use Instead |
|---|---|
| Non-linear relationships | Mutual Information, tree-based importance |
| Categorical features | Chi-square test |
| Feature interactions matter | Wrapper methods (RFE), embedded methods |
| High multicollinearity | Lasso, Ridge, or correlation analysis first |
| Very imbalanced classes | Consider resampling or weighted metrics |
VI. Practical Considerations
1. Feature Scaling
Fisher's Score is scale-sensitive because variance calculations depend on feature magnitude.
from sklearn.preprocessing import StandardScaler
# Wrong: Raw features with different scales
X_raw = pd.DataFrame({
'height_cm': [150, 160, 170, 180], # Range: ~30
'income_usd': [30000, 45000, 60000, 80000] # Range: ~50,000
})
f_scores_wrong, _ = f_classif(X_raw, y)
# Income will dominate due to scale!
# Right: Standardize first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)
f_scores_right, _ = f_classif(X_scaled, y)
# Now features are comparable
2. Linearity Assumption
Fisher's Score assumes linear separability. For non-linear patterns:
from sklearn.feature_selection import mutual_info_classif
# Example: Non-linear relationship
X_nonlinear = pd.DataFrame({
'linear_feature': [1, 2, 3, 4, 5, 6, 7, 8],
'quadratic_feature': [1, 4, 9, 16, 25, 36, 49, 64] # x²
})
y_nonlinear = [0, 0, 0, 0, 1, 1, 1, 1]
# Fisher's Score (may miss non-linearity)
f_scores, _ = f_classif(X_nonlinear, y_nonlinear)
# Mutual Information (captures non-linearity)
mi_scores = mutual_info_classif(X_nonlinear, y_nonlinear)
print(pd.DataFrame({
'Feature': X_nonlinear.columns,
'Fisher_Score': f_scores,
'Mutual_Info': mi_scores
}))
Output
Feature Fisher_Score Mutual_Info
0 linear_feature 19.20000 0.422024
1 quadratic_feature 15.90184 0.494940
3. Feature Type Compatibility
| Feature Type | Works? | Notes |
|---|---|---|
| Continuous | ✅ Yes | Native support |
| Ordinal | ✅ Yes | Treat as numeric after encoding |
| Binary | ✅ Yes | Encode as 0/1 |
| Categorical (unordered) | ❌ No | Use Chi-square or one-hot encode first |
| Discretized numeric | ⚠️ Depends | Works if encoding preserves order |
VII. Pros and Cons
✅ Advantages
| Advantage | Explanation |
|---|---|
| Fast computation | |
| Statistically sound | Based on ANOVA — well-understood theory |
| Provides p-values | Enables hypothesis testing (significance) |
| Interpretable | Easy to explain to stakeholders |
| Versatile | Works for classification (f_classif) and regression (f_regression) |
| Model-agnostic | Filter method — independent of ML algorithm |
❌ Limitations
| Limitation | Explanation | Mitigation |
|---|---|---|
| Linear assumption | Misses non-linear patterns | Use Mutual Information or tree-based methods |
| Univariate | Ignores feature interactions | Combine with wrapper methods (RFE) |
| Scale-sensitive | Features on different scales distort scores | Always standardize first |
| Assumes normality | Performs best with normally distributed features | Check distributions; consider transformations |
| Multicollinearity | Redundant features both score high | Check correlation matrix; use VIF |
VIII. Common Pitfalls and Best Practices
🚫 Pitfall 1: Skipping Normalization
Problem: Features on different scales produce incomparable scores.
# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
# Wrong
X_raw = pd.DataFrame({
'age': [25, 30, 35, 40], # Range: 15
'salary': [30000, 50000, 70000, 90000] # Range: 60,000
})
# Salary will dominate even if age is more predictive!
# Right
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)
🚫 Pitfall 2: Treating All High Scores as Important
Problem: High F-score doesn't guarantee practical importance.
Solution: Check p-values and effect size
# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
# Filter by both F-score AND p-value
significant_features = results[
(results['Fisher_Score'] > 10) & # High score
(results['P_Value'] < 0.05) # Statistically significant
]
🚫 Pitfall 3: Ignoring Multicollinearity
Problem: Correlated features both score high but provide redundant information.
Solution: Check correlation before selection
# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
# Check correlation matrix
correlation_matrix = X.corr()
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
for j in range(i):
if abs(correlation_matrix.iloc[i, j]) > 0.9:
high_corr_pairs.append((
correlation_matrix.columns[i],
correlation_matrix.columns[j],
correlation_matrix.iloc[i, j]
))
if high_corr_pairs:
print("⚠️ Highly correlated features detected:")
for feat1, feat2, corr in high_corr_pairs:
print(f" {feat1} <-> {feat2}: {corr:.3f}")
🚫 Pitfall 4: Using with Imbalanced Classes
Problem: Rare classes can distort variance calculations.
Solution: Consider class weighting or resampling
# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
from collections import Counter
class_counts = Counter(y)
print(f"Class distribution: {class_counts}")
if min(class_counts.values()) / max(class_counts.values()) < 0.1:
print("⚠️ Severe class imbalance detected!")
print("Consider: SMOTE, class_weight='balanced', or stratified sampling")
✅ Best Practices
- Always standardize features before computing Fisher's Score
- Check p-values in addition to F-scores (statistical significance ≠ practical importance)
- Visualize distributions to verify linear separability assumption
- Combine with correlation analysis to remove redundant high-scoring features
- Use as first filter in a multi-stage selection pipeline
- Validate with cross-validation to ensure selected features generalize
IX. Fisher's Score vs. Other Methods
| Method | Type | Captures Non-linearity | Feature Interactions | Speed | Best For |
|---|---|---|---|---|---|
| Fisher's Score | Filter | ❌ No | ❌ No | ⚡⚡⚡ Fast | Linear classification |
| Chi-Square | Filter | ❌ No | ❌ No | ⚡⚡⚡ Fast | Categorical features |
| Mutual Information | Filter | ✅ Yes | ❌ No | ⚡⚡ Medium | Non-linear relationships |
| RFE | Wrapper | ✅ Yes | ✅ Yes | 🐌 Slow | Model-specific optimization |
| Lasso | Embedded | ✅ Yes | ⚠️ Limited | ⚡⚡ Medium | High-dimensional linear models |
| Tree-based Importance | Intrinsic | ✅ Yes | ✅ Yes | ⚡⚡ Medium | Tree ensemble models |
X. Relationship to sklearn Functions
In scikit-learn, Fisher's Score is implemented through ANOVA F-test functions:
# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
from sklearn.feature_selection import f_classif, f_regression
# For classification
f_scores_class, p_values_class = f_classif(X, y)
# Returns: (Fisher's Score, p-value from ANOVA F-test)
# For regression
f_scores_reg, p_values_reg = f_regression(X, y)
# Returns: (F-statistic, p-value from F-test)
What you get:
f_scores→ Fisher's Score (magnitude of separation/correlation)p_values→ Statistical significance (probability of observing this by chance)
Practical usage:
- For feature selection → Rank by F-scores (higher = better)
- For statistical validation → Filter by p-values (< 0.05 = significant)
# Combined approach
selector = SelectKBest(f_classif, k=10) # Keep top 10 by F-score
X_selected = selector.fit_transform(X, y)
# Get both scores and p-values
scores = selector.scores_
pvalues = selector.pvalues_
# Filter by both criteria
selected_idx = (scores > np.median(scores)) & (pvalues < 0.05)
XI. Summary
Quick Reference Card
Fisher's Score Feature Selection:
✓ Supervised filter method (requires target labels)
✓ Measures between-class vs within-class variance
✓ Higher score = better class separation
✓ Always standardize features first
✓ Check p-values for statistical significance
✓ Works for classification (f_classif) and regression (f_regression)
✓ Assumes linear relationships and independence
✓ Use as first filter, then apply wrapper/embedded methods
Recommended Workflow
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
# Complete pipeline
pipeline = Pipeline([
('scaler', StandardScaler()), # 1. Normalize
('fisher', SelectKBest(f_classif, k=20)), # 2. Fisher's Score filter
('model', RandomForestClassifier()) # 3. Final model
])
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)