Information Gain

Imagine you’re playing the game 20 Questions. You need to guess an object by asking yes/no questions.

Now, some questions are powerful: — “Is it alive?” immediately cuts the possibilities in half.
Other questions are weak: — “Does it weigh more than 2 pounds?” might not eliminate much.

In machine learning, features are like the questions. And Information Gain (IG) helps us figure out which questions (features) give the biggest clue about the answer (target).

What is Information Gain?

Information gain is the measure of homogeneity we achieve after splitting a data node. In statistical terminology the amount of reduction in entropy after splitting the data node is known as information gain. In short, think of Information Gain as a measure of how much a feature helps reduce uncertainty about the final outcome.

High Information Gain = This feature is incredibly useful! It dramatically cleans up and sorts the data, making it much easier to make a prediction.
Low Information Gain = This feature isn’t very helpful. Splitting the data based on it leaves you almost as confused as you were before.

The primary job of Information Gain is feature selection. The feature with the highest Information Gain is the one selected to be the next split in the Decision Tree. It’s the feature that provides the “most bang for your buck” in terms of clarity. Basically it is a mathematical tool that helps it pick the most informative feature to ask about first.

The Key to Information Gain: Entropy

To understand Information Gain, you must first understand the concept it’s built upon: ML_AI/_feature_engineering/feature_selection/approaches/Entropy.
Additionally if you would be worth reading math behind "Information Gain"

Use Cases

Decision trees — IG is the heart of how decision trees split nodes.
Text classification — finding words that reduce uncertainty about categories.
Pre-filtering features — to eliminate irrelevant features before training.
Medical research — identifying which patient attributes most strongly predict disease.=

🧠 Requirements & Data Compatibility for Feature Selection

Categorical features? ✅ Works for both numeric and categorical features (with encoding).
Regression targets? ❌ Pure IG is for classification. For regression, use variants like mutual_info_regression.
Transformations required? Sometimes — categorical → numeric encoding.

Univariate, Bivariate, or Multivariate?

Univariate: Information Gain is commonly used here — one feature vs. target.
Bivariate: You could check IG between two features, but it’s less common.
Multivariate: Mutual Information is often used as a generalization for multiple variables.

🏆 Strategic Advantages

Captures non-linear relationships (unlike correlation).
Works for both numeric and categorical features (with proper encoding).
Theoretically grounded — based on information theory.

⚠️ Constraints

Computationally expensive on large datasets.
Needs discretization for continuous features in some implementations.
Biased towards features with more categories (e.g., unique IDs).

🚨 Caution — Common Misconceptions

Don’t confuse IG with Mutual Information — IG is basically MI between feature and target, but used specifically in the feature selection context.
Overfitting risk — features with many unique values may show high IG but be meaningless.
Encoding matters — categorical features need one-hot encoding or similar transformation.

Difference Between Information Gain and Mutual Information

Information Gain (IG): Specific to feature vs. target in classification.
Mutual Information (MI): General measure of dependency between any two variables (can be continuous or discrete, univariate or multivariate).
So:
IG = special case of MI used in feature selection.
MI = more general, broader concept.

Final Thoughts

Information Gain is like asking: “Which feature gives me the most useful hint about the answer?” It shines in decision trees and classification tasks where uncertainty needs to be reduced step by step. It is Best suited for, Decision Trees, Random Forests, and models where feature-target dependency is key.

It’s not perfect, and it has its quirks — but when used carefully, it’s a powerful filter for finding the features that really matter.

Code Example

# Step 1: Import Libraries
import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import KBinsDiscretizer

# Step 2: Load Dataset
data = load_breast_cancer()  X = pd.DataFrame(data.data, columns=data.feature_names) y = pd.Series(data.target)

X.sample(1)

# Step 3: Convert Features → Categorical (Binning)
#              Information Gain requires categorical variables.

discretizer = KBinsDiscretizer(
    n_bins=5,
    encode='ordinal',
    strategy='quantile',
    quantile_method = 'averaged_inverted_cdf'
)

X_binned = discretizer.fit_transform(X)

X_cat = pd.DataFrame(X_binned, columns=X.columns)
X_cat.sample(3)

```python ln=false from math import log2

Step 4: Define Entropy Function

def entropy(y):
probs = y.value_counts(normalize=True)
return -sum(p * log2(p) for p in probs)

Step 5: Define Conditional Entropy

def conditional_entropy(X, y):
cond_entropy = 0
for value in X.unique():
subset = y[X == value]
weight = len(subset) / len(y)
cond_entropy += weight * entropy(subset)
return cond_entropy

Step 6: Compute Information Gain

def information_gain(X, y):
return entropy(y) - conditional_entropy(X, y)


```python ln=false
# Step 7: Calculate IG for Every Feature
ig_scores = {}

for column in X_cat.columns:    
    ig = information_gain(X_cat[column], y)    
    ig_scores[column] = ig

ig_df = pd.DataFrame(ig_scores.items(),columns=["Feature", "Information Gain"])
ig_df = ig_df.sort_values(by="Information Gain", ascending=False)

pd.DataFrame(ig_df.head(10))

```python ln=false # Plot the Information Gain Bar plot import seaborn as sns import matplotlib.pyplot as plt

sns.set(style="whitegrid")
sns.set(rc={'figure.figsize':(12,8)})
sns.barplot(x=ig_df['Information Gain'], y=ig_df['Feature'])
plt.title('Information Gain of Features')
plt.xlabel('Information Gain')
plt.ylabel('Feature Name')
plt.show()

<img src="Learning/Stats/Pictures/ig-5.png" width="50%">