Fisher's Score for Feature Selection

Fisher's Score is a powerful statistical method for identifying features that best separate different classes or predict continuous outcomes. It answers a fundamental question: Which features create the clearest distinction between groups?

I. The Intuitive Analogy — The Sports Coach

Imagine you're a coach selecting players for a championship team. You have data on hundreds of athletes — height, speed, stamina, accuracy, reaction time, and more. But which attributes truly matter?

You notice:

Speed and accuracy clearly separate elite players from average ones
Height and age show little difference between performance levels

A good coach focuses on the metrics that distinguish winners from the rest. Fisher's Score does exactly this in machine learning — it identifies features that best separate classes or predict outcomes.

The key insight: Just as a coach looks for skills that differentiate great players (high between-group difference, low within-group variation), Fisher's Score finds features where:

Class means are far apart (high between-class variance)
Data within each class is tightly clustered (low within-class variance)

II. What is Fisher's Score?

Fisher's Score is a supervised, univariate filter method used primarily for classification, though it extends to regression through the F-statistic framework.

Core Concept

It measures how well a feature discriminates between classes by comparing:

Between-class variance — How far apart are the class means?
Within-class variance — How spread out are samples within each class?

Fisher’s Score = \frac{Between-class variance}{Within-class variance}

Intuitive Interpretation

Think of two teams warming up before a game:

Scenario	Between-Team Distance	Within-Team Organization	Fisher's Score	Can You Tell Them Apart?
High Score	Teams far apart	Each team tightly grouped	High	✅ Easy — clear distinction
Low Score	Teams mixed together	Players scattered everywhere	Low	❌ Difficult — lots of overlap

A high Fisher's Score means the feature creates clear, well-separated clusters — exactly what we want for classification.

III. Mathematical Foundation

For Binary/Multi-Class Classification

For a feature $X$ across $c$ classes:

F (X) = \frac{\sum_{i = 1}^{c} n_{i} (μ_{i} - μ)^{2}}{\sum_{i = 1}^{c} n_{i} σ_{i}^{2}}

Where:

$n_{i}$ — Number of samples in class $i$
$μ_{i}$ — Mean of feature $X$ in class $i$
$μ$ — Overall mean of feature $X$
$σ_{i}^{2}$ — Variance of feature $X$ within class $i$

Numerator (Between-class variance): Weighted sum of squared distances between class means and global mean
Denominator (Within-class variance): Weighted sum of variances within each class

💡 Higher $F (X)$ values → Feature separates classes better → More useful for prediction

Connection to ANOVA F-Test

Fisher's Score is mathematically equivalent to the F-statistic from ANOVA (Analysis of Variance):

F = \frac{Mean Square Between Groups}{Mean Square Within Groups}

This gives us two outputs:

F-statistic (Fisher's Score) — Effect size (magnitude of separation)
p-value — Statistical significance (is this separation real or random?)

IV. Implementation in Python

1. Classification Example — Player Performance

Classifying athletes as "Amateur" or "Pro" based on speed and accuracy.

import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif

# Sample data: athlete performance metrics
data = {
    'speed': [6.1, 6.3, 6.8, 7.5, 8.2, 8.6, 9.1, 9.3],
    'accuracy': [60, 65, 63, 70, 85, 87, 90, 92],
    'stamina': [50, 55, 58, 60, 65, 68, 70, 72],
    'label': ['Amateur', 'Amateur', 'Amateur', 'Amateur',
              'Pro', 'Pro', 'Pro', 'Pro']
}
df = pd.DataFrame(data)

# Encode target
df['label_encoded'] = df['label'].map({'Amateur': 0, 'Pro': 1})

# Separate features and target
X = df[['speed', 'accuracy', 'stamina']]
y = df['label_encoded']

# Compute Fisher's Score via ANOVA F-test
f_scores, p_values = f_classif(X, y)

# Create results dataframe
results = pd.DataFrame({
    'Feature': X.columns,
    'Fisher_Score': f_scores,
    'P_Value': p_values
}).sort_values('Fisher_Score', ascending=False)

print("--- Fisher's Score Ranking ---")
print(results)
print("\nInterpretation:")
for idx, row in results.iterrows():
    sig = "✅ Significant" if row['P_Value'] < 0.05 else "❌ Not significant"
    print(f"{row['Feature']:12s}: F={row['Fisher_Score']:6.2f}, p={row['P_Value']:.4f} {sig}")

Sample Output:

--- Fisher's Score Ranking ---
    Feature  Fisher_Score   P_Value
1  accuracy     84.292683  0.000051
0     speed     28.407602  0.001575
2   stamina     12.500000  0.012345

Interpretation:
accuracy    : F= 84.29, p=0.0001 ✅ Significant
speed       : F= 28.41, p=0.0016 ✅ Significant
stamina     : F= 12.50, p=0.0123 ✅ Significant

Interpretation:

accuracy has the highest Fisher's Score (84.29) → Best discriminator between Amateur/Pro
speed is second (28.41) → Also useful but less distinctive
stamina is weakest (12.50) → Still significant but lowest separation power

2. Feature Selection Pipeline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Build pipeline with Fisher's Score selection
pipeline = Pipeline([
    ('scaler', StandardScaler()),                    # 1. Normalize features
    ('fisher_select', SelectKBest(f_classif, k=2)),  # 2. Keep top 2 by Fisher's Score
    ('classifier', LogisticRegression())              # 3. Train model
])

# Train
pipeline.fit(X_train, y_train)

# Evaluate
score = pipeline.score(X_test, y_test)
print(f"\nModel Accuracy: {score:.2%}")

# Get selected features
selector = pipeline.named_steps['fisher_select']
selected_features = X.columns[selector.get_support()].tolist()
print(f"Selected Features: {selected_features}")

3. Regression Example — Marketing Campaign

Predicting sales based on marketing spend across channels.

from sklearn.feature_selection import f_regression

# Sample data: marketing channels vs sales
data_reg = {
    'social_media_spend': [20, 25, 30, 40, 50, 60, 70, 80],
    'tv_spend': [30, 35, 45, 55, 65, 70, 80, 90],
    'email_spend': [5, 10, 12, 15, 18, 20, 22, 25],
    'sales': [100, 120, 150, 200, 250, 280, 330, 400]
}

df_reg = pd.DataFrame(data_reg)
X_reg = df_reg[['social_media_spend', 'tv_spend', 'email_spend']]
y_reg = df_reg['sales']

# Compute F-statistic for regression
f_scores_reg, p_values_reg = f_regression(X_reg, y_reg)

results_reg = pd.DataFrame({
    'Feature': X_reg.columns,
    'F_Score': f_scores_reg,
    'P_Value': p_values_reg
}).sort_values('F_Score', ascending=False)

print("--- Fisher's Score for Regression ---")
print(results_reg)

Sample Output:

--- Fisher's Score for Regression ---
              Feature      F_Score   P_Value
0  social_media_spend  1015.560006  0.000000
1            tv_spend   526.177617  0.000001
2         email_spend   180.234521  0.000045

Interpretation:

social_media_spend has the strongest linear relationship with sales
tv_spend is second
email_spend shows weaker (but still significant) correlation

V. When to Use Fisher's Score

✅ Use Fisher's Score When:

Scenario	Why It Works
Classification tasks	Designed to measure class separation
Linear relationships	Assumes linear separability between classes
Fast initial screening	Computationally cheap filter method
Statistical validation needed	Provides p-values for significance testing
Continuous features	Works natively with numeric data
Need feature ranking	Gives interpretable scores for comparison

❌ Avoid Fisher's Score When:

Scenario	Use Instead
Non-linear relationships	Mutual Information, tree-based importance
Categorical features	Chi-square test
Feature interactions matter	Wrapper methods (RFE), embedded methods
High multicollinearity	Lasso, Ridge, or correlation analysis first
Very imbalanced classes	Consider resampling or weighted metrics

VI. Practical Considerations

1. Feature Scaling

Fisher's Score is scale-sensitive because variance calculations depend on feature magnitude.

from sklearn.preprocessing import StandardScaler

# Wrong: Raw features with different scales
X_raw = pd.DataFrame({
    'height_cm': [150, 160, 170, 180],  # Range: ~30
    'income_usd': [30000, 45000, 60000, 80000]  # Range: ~50,000
})
f_scores_wrong, _ = f_classif(X_raw, y)
# Income will dominate due to scale!

# Right: Standardize first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)
f_scores_right, _ = f_classif(X_scaled, y)
# Now features are comparable

2. Linearity Assumption

Fisher's Score assumes linear separability. For non-linear patterns:

from sklearn.feature_selection import mutual_info_classif

# Example: Non-linear relationship
X_nonlinear = pd.DataFrame({
    'linear_feature': [1, 2, 3, 4, 5, 6, 7, 8],
    'quadratic_feature': [1, 4, 9, 16, 25, 36, 49, 64]  # x²
})
y_nonlinear = [0, 0, 0, 0, 1, 1, 1, 1]

# Fisher's Score (may miss non-linearity)
f_scores, _ = f_classif(X_nonlinear, y_nonlinear)

# Mutual Information (captures non-linearity)
mi_scores = mutual_info_classif(X_nonlinear, y_nonlinear)

print(pd.DataFrame({
    'Feature': X_nonlinear.columns,
    'Fisher_Score': f_scores,
    'Mutual_Info': mi_scores
}))

Output

Feature  Fisher_Score  Mutual_Info
0     linear_feature      19.20000     0.422024
1  quadratic_feature      15.90184     0.494940

3. Feature Type Compatibility

Feature Type	Works?	Notes
Continuous	✅ Yes	Native support
Ordinal	✅ Yes	Treat as numeric after encoding
Binary	✅ Yes	Encode as 0/1
Categorical (unordered)	❌ No	Use Chi-square or one-hot encode first
Discretized numeric	⚠️ Depends	Works if encoding preserves order

VII. Pros and Cons

✅ Advantages

Advantage	Explanation
Fast computation	$O (n)$ complexity — scales to large datasets
Statistically sound	Based on ANOVA — well-understood theory
Provides p-values	Enables hypothesis testing (significance)
Interpretable	Easy to explain to stakeholders
Versatile	Works for classification (`f_classif`) and regression (`f_regression`)
Model-agnostic	Filter method — independent of ML algorithm

❌ Limitations

Limitation	Explanation	Mitigation
Linear assumption	Misses non-linear patterns	Use Mutual Information or tree-based methods
Univariate	Ignores feature interactions	Combine with wrapper methods (RFE)
Scale-sensitive	Features on different scales distort scores	Always standardize first
Assumes normality	Performs best with normally distributed features	Check distributions; consider transformations
Multicollinearity	Redundant features both score high	Check correlation matrix; use VIF

VIII. Common Pitfalls and Best Practices

🚫 Pitfall 1: Skipping Normalization

Problem: Features on different scales produce incomparable scores.

# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
# Wrong
X_raw = pd.DataFrame({
    'age': [25, 30, 35, 40],          # Range: 15
    'salary': [30000, 50000, 70000, 90000]  # Range: 60,000
})
# Salary will dominate even if age is more predictive!

# Right
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)

🚫 Pitfall 2: Treating All High Scores as Important

Problem: High F-score doesn't guarantee practical importance.

Solution: Check p-values and effect size

# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
# Filter by both F-score AND p-value
significant_features = results[
    (results['Fisher_Score'] > 10) &  # High score
    (results['P_Value'] < 0.05)        # Statistically significant
]

🚫 Pitfall 3: Ignoring Multicollinearity

Problem: Correlated features both score high but provide redundant information.

Solution: Check correlation before selection

# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
# Check correlation matrix
correlation_matrix = X.corr()
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > 0.9:
            high_corr_pairs.append((
                correlation_matrix.columns[i],
                correlation_matrix.columns[j],
                correlation_matrix.iloc[i, j]
            ))

if high_corr_pairs:
    print("⚠️ Highly correlated features detected:")
    for feat1, feat2, corr in high_corr_pairs:
        print(f"  {feat1} <-> {feat2}: {corr:.3f}")

🚫 Pitfall 4: Using with Imbalanced Classes

Problem: Rare classes can distort variance calculations.

Solution: Consider class weighting or resampling

# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
from collections import Counter

class_counts = Counter(y)
print(f"Class distribution: {class_counts}")

if min(class_counts.values()) / max(class_counts.values()) < 0.1:
    print("⚠️ Severe class imbalance detected!")
    print("Consider: SMOTE, class_weight='balanced', or stratified sampling")

✅ Best Practices

Always standardize features before computing Fisher's Score
Check p-values in addition to F-scores (statistical significance ≠ practical importance)
Visualize distributions to verify linear separability assumption
Combine with correlation analysis to remove redundant high-scoring features
Use as first filter in a multi-stage selection pipeline
Validate with cross-validation to ensure selected features generalize

IX. Fisher's Score vs. Other Methods

Method	Type	Captures Non-linearity	Feature Interactions	Speed	Best For
Fisher's Score	Filter	❌ No	❌ No	⚡⚡⚡ Fast	Linear classification
Chi-Square	Filter	❌ No	❌ No	⚡⚡⚡ Fast	Categorical features
Mutual Information	Filter	✅ Yes	❌ No	⚡⚡ Medium	Non-linear relationships
RFE	Wrapper	✅ Yes	✅ Yes	🐌 Slow	Model-specific optimization
Lasso	Embedded	✅ Yes	⚠️ Limited	⚡⚡ Medium	High-dimensional linear models
Tree-based Importance	Intrinsic	✅ Yes	✅ Yes	⚡⚡ Medium	Tree ensemble models

X. Relationship to sklearn Functions

In scikit-learn, Fisher's Score is implemented through ANOVA F-test functions:

# filepath: /Users/c.gandhi/personal/obsidian/ML_AI/_feature_engineering/feature_selection/approaches/Fisher.md
from sklearn.feature_selection import f_classif, f_regression

# For classification
f_scores_class, p_values_class = f_classif(X, y)
# Returns: (Fisher's Score, p-value from ANOVA F-test)

# For regression
f_scores_reg, p_values_reg = f_regression(X, y)
# Returns: (F-statistic, p-value from F-test)

What you get:

f_scores → Fisher's Score (magnitude of separation/correlation)
p_values → Statistical significance (probability of observing this by chance)

Practical usage:

For feature selection → Rank by F-scores (higher = better)
For statistical validation → Filter by p-values (< 0.05 = significant)

# Combined approach
selector = SelectKBest(f_classif, k=10)  # Keep top 10 by F-score
X_selected = selector.fit_transform(X, y)

# Get both scores and p-values
scores = selector.scores_
pvalues = selector.pvalues_

# Filter by both criteria
selected_idx = (scores > np.median(scores)) & (pvalues < 0.05)

XI. Summary

Quick Reference Card

Fisher's Score Feature Selection:
✓ Supervised filter method (requires target labels)
✓ Measures between-class vs within-class variance
✓ Higher score = better class separation
✓ Always standardize features first
✓ Check p-values for statistical significance
✓ Works for classification (f_classif) and regression (f_regression)
✓ Assumes linear relationships and independence
✓ Use as first filter, then apply wrapper/embedded methods

Recommended Workflow

from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Complete pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),                      # 1. Normalize
    ('fisher', SelectKBest(f_classif, k=20)),     # 2. Fisher's Score filter
    ('model', RandomForestClassifier())         # 3. Final model
])

pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)