Information Gain

Imagine you’re playing the game 20 Questions. You need to guess an object by asking yes/no questions.

In machine learning, features are like the questions. And Information Gain (IG) helps us figure out which questions (features) give the biggest clue about the answer (target).

What is Information Gain?

Information gain is the measure of homogeneity we achieve after splitting a data node. In statistical terminology the amount of reduction in entropy after splitting the data node is known as information gain. In short, think of Information Gain as a measure of how much a feature helps reduce uncertainty about the final outcome.

The primary job of Information Gain is feature selection. The feature with the highest Information Gain is the one selected to be the next split in the Decision Tree. It’s the feature that provides the “most bang for your buck” in terms of clarity. Basically it is a mathematical tool that helps it pick the most informative feature to ask about first.

The Key to Information Gain: Entropy

Use Cases

  1. Decision trees — IG is the heart of how decision trees split nodes.
  2. Text classification — finding words that reduce uncertainty about categories.
  3. Pre-filtering features — to eliminate irrelevant features before training.
  4. Medical research — identifying which patient attributes most strongly predict disease.=

🧠 Requirements & Data Compatibility for Feature Selection

Univariate, Bivariate, or Multivariate?

🏆 Strategic Advantages

  1. Captures non-linear relationships (unlike correlation).
  2. Works for both numeric and categorical features (with proper encoding).
  3. Theoretically grounded — based on information theory.

⚠️ Constraints

  1. Computationally expensive on large datasets.
  2. Needs discretization for continuous features in some implementations.
  3. Biased towards features with more categories (e.g., unique IDs).

🚨 Caution — Common Misconceptions

  1. Don’t confuse IG with Mutual Information — IG is basically MI between feature and target, but used specifically in the feature selection context.
  2. Overfitting risk — features with many unique values may show high IG but be meaningless.
  3. Encoding matters — categorical features need one-hot encoding or similar transformation.

Difference Between Information Gain and Mutual Information

Final Thoughts

Information Gain is like asking: “Which feature gives me the most useful hint about the answer?” It shines in decision trees and classification tasks where uncertainty needs to be reduced step by step. It is Best suited for, Decision Trees, Random Forests, and models where feature-target dependency is key.

It’s not perfect, and it has its quirks — but when used carefully, it’s a powerful filter for finding the features that really matter.


Code Example

# Step 1: Import Libraries
import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import KBinsDiscretizer

# Step 2: Load Dataset
data = load_breast_cancer()  X = pd.DataFrame(data.data, columns=data.feature_names) y = pd.Series(data.target)

X.sample(1)
# Step 3: Convert Features → Categorical (Binning)
#              Information Gain requires categorical variables.

discretizer = KBinsDiscretizer(
    n_bins=5,
    encode='ordinal',
    strategy='quantile',
    quantile_method = 'averaged_inverted_cdf'
)

X_binned = discretizer.fit_transform(X)

X_cat = pd.DataFrame(X_binned, columns=X.columns)
X_cat.sample(3)
```python ln=false from math import log2

Step 4: Define Entropy Function

def entropy(y):
probs = y.value_counts(normalize=True)
return -sum(p * log2(p) for p in probs)

Step 5: Define Conditional Entropy

def conditional_entropy(X, y):
cond_entropy = 0
for value in X.unique():
subset = y[X == value]
weight = len(subset) / len(y)
cond_entropy += weight * entropy(subset)
return cond_entropy

Step 6: Compute Information Gain

def information_gain(X, y):
return entropy(y) - conditional_entropy(X, y)


```python ln=false
# Step 7: Calculate IG for Every Feature
ig_scores = {}

for column in X_cat.columns:    
    ig = information_gain(X_cat[column], y)    
    ig_scores[column] = ig

ig_df = pd.DataFrame(ig_scores.items(),columns=["Feature", "Information Gain"])
ig_df = ig_df.sort_values(by="Information Gain", ascending=False)

pd.DataFrame(ig_df.head(10))
```python ln=false # Plot the Information Gain Bar plot import seaborn as sns import matplotlib.pyplot as plt

sns.set(style="whitegrid")
sns.set(rc={'figure.figsize':(12,8)})
sns.barplot(x=ig_df['Information Gain'], y=ig_df['Feature'])
plt.title('Information Gain of Features')
plt.xlabel('Information Gain')
plt.ylabel('Feature Name')
plt.show()

<img src="Learning/Stats/Pictures/ig-5.png" width="50%">