Fisher's Score for Feature Selection

Fisher's Score is a powerful statistical method for identifying features that best separate different classes or predict continuous outcomes. It answers a fundamental question: Which features create the clearest distinction between groups?

I. The Intuitive Analogy — The Sports Coach

Imagine you're a coach selecting players for a championship team. You have data on hundreds of athletes — height, speed, stamina, accuracy, reaction time, and more. But which attributes truly matter?

You notice:

A good coach focuses on the metrics that distinguish winners from the rest. Fisher's Score does exactly this in machine learning — it identifies features that best separate classes or predict outcomes.

The key insight: Just as a coach looks for skills that differentiate great players (high between-group difference, low within-group variation), Fisher's Score finds features where:

II. What is Fisher's Score?

Fisher's Score is a supervised, univariate filter method used primarily for classification, though it extends to regression through the F-statistic framework.

Core Concept

It measures how well a feature discriminates between classes by comparing:

  1. Between-class variance — How far apart are the class means?
  2. Within-class variance — How spread out are samples within each class?
Fisher’s Score=Between-class varianceWithin-class variance

Intuitive Interpretation

Think of two teams warming up before a game:

Scenario Between-Team Distance Within-Team Organization Fisher's Score Can You Tell Them Apart?
High Score Teams far apart Each team tightly grouped High ✅ Easy — clear distinction
Low Score Teams mixed together Players scattered everywhere Low ❌ Difficult — lots of overlap

A high Fisher's Score means the feature creates clear, well-separated clusters — exactly what we want for classification.

III. Mathematical Foundation

For Binary/Multi-Class Classification

For a feature X across c classes:

F(X)=i=1cni(μiμ)2i=1cniσi2

Where:

Numerator (Between-class variance): Weighted sum of squared distances between class means and global mean
Denominator (Within-class variance): Weighted sum of variances within each class

💡 Higher F(X) values → Feature separates classes better → More useful for prediction

Connection to ANOVA F-Test

Fisher's Score is mathematically equivalent to the F-statistic from ANOVA (Analysis of Variance):

F=Mean Square Between GroupsMean Square Within Groups

This gives us two outputs:

  1. F-statistic (Fisher's Score) — Effect size (magnitude of separation)
  2. p-value — Statistical significance (is this separation real or random?)

IV. Implementation in Python

1. Classification Example — Player Performance

Classifying athletes as "Amateur" or "Pro" based on speed and accuracy.

import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif

# Sample data: athlete performance metrics
data = {
    'speed': [6.1, 6.3, 6.8, 7.5, 8.2, 8.6, 9.1, 9.3],
    'accuracy': [60, 65, 63, 70, 85, 87, 90, 92],
    'stamina': [50, 55, 58, 60, 65, 68, 70, 72],
    'label': ['Amateur', 'Amateur', 'Amateur', 'Amateur',
              'Pro', 'Pro', 'Pro', 'Pro']
}
df = pd.DataFrame(data)

# Encode target
df['label_encoded'] = df['label'].map({'Amateur': 0, 'Pro': 1})

# Separate features and target
X = df[['speed', 'accuracy', 'stamina']]
y = df['label_encoded']

# Compute Fisher's Score via ANOVA F-test
f_scores, p_values = f_classif(X, y)

# Create results dataframe
results = pd.DataFrame({
    'Feature': X.columns,
    'Fisher_Score': f_scores,
    'P_Value': p_values
}).sort_values('Fisher_Score', ascending=False)

print("--- Fisher's Score Ranking ---")
print(results)
print("\nInterpretation:")
for idx, row in results.iterrows():
    sig = "✅ Significant" if row['P_Value'] < 0.05 else "❌ Not significant"
    print(f"{row['Feature']:12s}: F={row['Fisher_Score']:6.2f}, p={row['P_Value']:.4f} {sig}")

Sample Output:

--- Fisher's Score Ranking ---
    Feature  Fisher_Score   P_Value
1  accuracy     84.292683  0.000051
0     speed     28.407602  0.001575
2   stamina     12.500000  0.012345

Interpretation:
accuracy    : F= 84.29, p=0.0001 ✅ Significant
speed       : F= 28.41, p=0.0016 ✅ Significant
stamina     : F= 12.50, p=0.0123 ✅ Significant

Interpretation:

2. Feature Selection Pipeline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Build pipeline with Fisher's Score selection
pipeline = Pipeline([
    ('scaler', StandardScaler()),                    # 1. Normalize features
    ('fisher_select', SelectKBest(f_classif, k=2)),  # 2. Keep top 2 by Fisher's Score
    ('classifier', LogisticRegression())              # 3. Train model
])

# Train
pipeline.fit(X_train, y_train)

# Evaluate
score = pipeline.score(X_test, y_test)
print(f"\nModel Accuracy: {score:.2%}")

# Get selected features
selector = pipeline.named_steps['fisher_select']
selected_features = X.columns[selector.get_support()].tolist()
print(f"Selected Features: {selected_features}")

3. Regression Example — Marketing Campaign

Predicting sales based on marketing spend across channels.

from sklearn.feature_selection import f_regression

# Sample data: marketing channels vs sales
data_reg = {
    'social_media_spend': [20, 25, 30, 40, 50, 60, 70, 80],
    'tv_spend': [30, 35, 45, 55, 65, 70, 80, 90],
    'email_spend': [5, 10, 12, 15, 18, 20, 22, 25],
    'sales': [100, 120, 150, 200, 250, 280, 330, 400]
}

df_reg = pd.DataFrame(data_reg)
X_reg = df_reg[['social_media_spend', 'tv_spend', 'email_spend']]
y_reg = df_reg['sales']

# Compute F-statistic for regression
f_scores_reg, p_values_reg = f_regression(X_reg, y_reg)

results_reg = pd.DataFrame({
    'Feature': X_reg.columns,
    'F_Score': f_scores_reg,
    'P_Value': p_values_reg
}).sort_values('F_Score', ascending=False)

print("--- Fisher's Score for Regression ---")
print(results_reg)

Sample Output:

--- Fisher's Score for Regression ---
              Feature      F_Score   P_Value
0  social_media_spend  1015.560006  0.000000
1            tv_spend   526.177617  0.000001
2         email_spend   180.234521  0.000045

Interpretation:

V. When to Use Fisher's Score

✅ Use Fisher's Score When:

Scenario Why It Works
Classification tasks Designed to measure class separation
Linear relationships Assumes linear separability between classes
Fast initial screening Computationally cheap filter method
Statistical validation needed Provides p-values for significance testing
Continuous features Works natively with numeric data
Need feature ranking Gives interpretable scores for comparison

❌ Avoid Fisher's Score When:

Scenario Use Instead
Non-linear relationships Mutual Information, tree-based importance
Categorical features Chi-square test
Feature interactions matter Wrapper methods (RFE), embedded methods
High multicollinearity Lasso, Ridge, or correlation analysis first
Very imbalanced classes Consider resampling or weighted metrics

VI. Practical Considerations

1. Feature Scaling

Fisher's Score is scale-sensitive because variance calculations depend on feature magnitude.

from sklearn.preprocessing import StandardScaler

# Wrong: Raw features with different scales
X_raw = pd.DataFrame({
    'height_cm': [150, 160, 170, 180],  # Range: ~30
    'income_usd': [30000, 45000, 60000, 80000]  # Range: ~50,000
})
f_scores_wrong, _ = f_classif(X_raw, y)
# Income will dominate due to scale!

# Right: Standardize first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)
f_scores_right, _ = f_classif(X_scaled, y)
# Now features are comparable

2. Linearity Assumption

Fisher's Score assumes linear separability. For non-linear patterns:

from sklearn.feature_selection import mutual_info_classif

# Example: Non-linear relationship
X_nonlinear = pd.DataFrame({
    'linear_feature': [1, 2, 3, 4, 5, 6, 7, 8],
    'quadratic_feature': [1, 4, 9, 16, 25, 36, 49, 64]  # x²
})
y_nonlinear = [0, 0, 0, 0, 1, 1, 1, 1]

# Fisher's Score (may miss non-linearity)
f_scores, _ = f_classif(X_nonlinear, y_nonlinear)

# Mutual Information (captures non-linearity)
mi_scores = mutual_info_classif(X_nonlinear, y_nonlinear)

print(pd.DataFrame({
    'Feature': X_nonlinear.columns,
    'Fisher_Score': f_scores,
    'Mutual_Info': mi_scores
}))

Output

Feature  Fisher_Score  Mutual_Info
0     linear_feature      19.20000     0.422024
1  quadratic_feature      15.90184     0.494940

3. Feature Type Compatibility

Feature Type Works? Notes
Continuous ✅ Yes Native support
Ordinal ✅ Yes Treat as numeric after encoding
Binary ✅ Yes Encode as 0/1
Categorical (unordered) ❌ No Use Chi-square or one-hot encode first
Discretized numeric ⚠️ Depends Works if encoding preserves order

VII. Pros and Cons

✅ Advantages

Advantage Explanation
Fast computation O(n) complexity — scales to large datasets
Statistically sound Based on ANOVA — well-understood theory
Provides p-values Enables hypothesis testing (significance)
Interpretable Easy to explain to stakeholders
Versatile Works for classification (f_classif) and regression (f_regression)
Model-agnostic Filter method — independent of ML algorithm

❌ Limitations

Limitation Explanation Mitigation
Linear assumption Misses non-linear patterns Use Mutual Information or tree-based methods
Univariate Ignores feature interactions Combine with wrapper methods (RFE)
Scale-sensitive Features on different scales distort scores Always standardize first
Assumes normality Performs best with normally distributed features Check distributions; consider transformations
Multicollinearity Redundant features both score high Check correlation matrix; use VIF

VIII. Common Pitfalls and Best Practices

🚫 Pitfall 1: Skipping Normalization

Problem: Features on different scales produce incomparable scores.

# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
# Wrong
X_raw = pd.DataFrame({
    'age': [25, 30, 35, 40],          # Range: 15
    'salary': [30000, 50000, 70000, 90000]  # Range: 60,000
})
# Salary will dominate even if age is more predictive!

# Right
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)

🚫 Pitfall 2: Treating All High Scores as Important

Problem: High F-score doesn't guarantee practical importance.

Solution: Check p-values and effect size

# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
# Filter by both F-score AND p-value
significant_features = results[
    (results['Fisher_Score'] > 10) &  # High score
    (results['P_Value'] < 0.05)        # Statistically significant
]

🚫 Pitfall 3: Ignoring Multicollinearity

Problem: Correlated features both score high but provide redundant information.

Solution: Check correlation before selection

# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
# Check correlation matrix
correlation_matrix = X.corr()
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > 0.9:
            high_corr_pairs.append((
                correlation_matrix.columns[i],
                correlation_matrix.columns[j],
                correlation_matrix.iloc[i, j]
            ))

if high_corr_pairs:
    print("⚠️ Highly correlated features detected:")
    for feat1, feat2, corr in high_corr_pairs:
        print(f"  {feat1} <-> {feat2}: {corr:.3f}")

🚫 Pitfall 4: Using with Imbalanced Classes

Problem: Rare classes can distort variance calculations.

Solution: Consider class weighting or resampling

# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
from collections import Counter

class_counts = Counter(y)
print(f"Class distribution: {class_counts}")

if min(class_counts.values()) / max(class_counts.values()) < 0.1:
    print("⚠️ Severe class imbalance detected!")
    print("Consider: SMOTE, class_weight='balanced', or stratified sampling")

✅ Best Practices

  1. Always standardize features before computing Fisher's Score
  2. Check p-values in addition to F-scores (statistical significance ≠ practical importance)
  3. Visualize distributions to verify linear separability assumption
  4. Combine with correlation analysis to remove redundant high-scoring features
  5. Use as first filter in a multi-stage selection pipeline
  6. Validate with cross-validation to ensure selected features generalize

IX. Fisher's Score vs. Other Methods

Method Type Captures Non-linearity Feature Interactions Speed Best For
Fisher's Score Filter ❌ No ❌ No ⚡⚡⚡ Fast Linear classification
Chi-Square Filter ❌ No ❌ No ⚡⚡⚡ Fast Categorical features
Mutual Information Filter ✅ Yes ❌ No ⚡⚡ Medium Non-linear relationships
RFE Wrapper ✅ Yes ✅ Yes 🐌 Slow Model-specific optimization
Lasso Embedded ✅ Yes ⚠️ Limited ⚡⚡ Medium High-dimensional linear models
Tree-based Importance Intrinsic ✅ Yes ✅ Yes ⚡⚡ Medium Tree ensemble models

X. Relationship to sklearn Functions

In scikit-learn, Fisher's Score is implemented through ANOVA F-test functions:

# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
from sklearn.feature_selection import f_classif, f_regression

# For classification
f_scores_class, p_values_class = f_classif(X, y)
# Returns: (Fisher's Score, p-value from ANOVA F-test)

# For regression
f_scores_reg, p_values_reg = f_regression(X, y)
# Returns: (F-statistic, p-value from F-test)

What you get:

Practical usage:

# Combined approach
selector = SelectKBest(f_classif, k=10)  # Keep top 10 by F-score
X_selected = selector.fit_transform(X, y)

# Get both scores and p-values
scores = selector.scores_
pvalues = selector.pvalues_

# Filter by both criteria
selected_idx = (scores > np.median(scores)) & (pvalues < 0.05)

XI. Summary

Quick Reference Card

Fisher's Score Feature Selection:
✓ Supervised filter method (requires target labels)
✓ Measures between-class vs within-class variance
✓ Higher score = better class separation
✓ Always standardize features first
✓ Check p-values for statistical significance
✓ Works for classification (f_classif) and regression (f_regression)
✓ Assumes linear relationships and independence
✓ Use as first filter, then apply wrapper/embedded methods
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Complete pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),                      # 1. Normalize
    ('fisher', SelectKBest(f_classif, k=20)),     # 2. Fisher's Score filter
    ('model', RandomForestClassifier())         # 3. Final model
])

pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)