Mutual Information and Information Gain

I. Entropy 》II. Joint Entropy 》III. Conditional Entropy 》IV. Mutual Information 》V. Information Gain

IV. Mutual Information (MI)

What is Mutual Information?

Mutual Information (MI) quantifies the amount of information obtained about one random variable by observing another random variable. In simpler terms, it measures how much knowing the value of variable $X$ reduces uncertainty about variable $Y$ (and vice versa).

Conditional Entropy is the "workhorse" behind Mutual Information (MI), which is the primary metric used in your feature selection to rank how relevant a feature $X$ is to a target $Y$ .

Mathematical Properties

Symmetry

I (X; Y) = I (Y; X)

Unlike conditional entropy, MI is symmetric. It doesn't matter which variable you consider first.

Relationship to Entropy

I (X; Y) = H (X) + H (Y) - H (X, Y)

This can be visualized as a Venn diagram:

$H (X)$ = Total uncertainty in $X$
$H (Y)$ = Total uncertainty in $Y$
$H (X, Y)$ = Joint uncertainty
MI = The overlapping region (shared information)

Alternative Formulations

All these are equivalent:

Via Conditional Entropy:

I (X; Y) = H (Y) - H (Y ∣ X) = H (X) - H (X ∣ Y)

Via Joint Entropy:

I (X; Y) = H (X) + H (Y) - H (X, Y)

Via Both Conditional Entropies:

I (X; Y) = H (X, Y) - H (X ∣ Y) - H (Y ∣ X)

Intuitive Interpretation

Scenario	MI Value	Interpretation
$X$ and $Y$ are independent	$I (X; Y) = 0$	No shared information
$X$ and $Y$ are perfectly correlated	$I (X; Y) = H (X) = H (Y)$	Complete information overlap
$X$ and $Y$ are partially related	$0 < I (X; Y) < H (X)$	Some shared information

V. Information Gain (IG)

Entropy tells us how impure a set of data is. Information Gain (IG) tells us how much that impurity is reduced after we split the data on a particular feature.

Information Gain (IG) is the reduction in entropy achieved by partitioning a dataset based on an attribute.

In other words, IG measures how much "information" a feature provides about the target class. A feature that creates very pure subgroups (low entropy) after a split has a high Information Gain.

Ever wonder

How a machine learning model, like a decision tree, makes a decision?
How does it learn to navigate complex data and make accurate predictions?

The answer lies in a powerful concept borrowed from information theory: Entropy and Information Gain (IG).

In Decision Trees (like ID3, C4.5), Information Gain is the practical application of Mutual Information. It measures the change in entropy from a "state of ignorance" to a "state of knowledge" after a split.
In other words, Information Gain is the standard metric used in decision trees to determine the best feature for a split. It measures how much "entropy" is removed from the target variable after partitioning the data based on a specific feature.

The Mechanism: It compares the entropy of the parent node to the weighted sum of the entropies of the child nodes.
The Strategy: High-performing models pick the feature that provides the highest Information Gain (the one that reduces the "impurity" the most).

★ Mathematical Formula and derivation

MI is calculated by subtracting the "uncertainty remaining" from the "total uncertainty":

\begin{aligned} Information Gain (Target S, Independent variable X) & = Entropy(S) - Conditional Entropy(S|X) \\ I (X, Y) & = H (Y) - H (Y | X) & \dots (1) \\ I (X, Y) & = H (X) - H (X | Y) & \dots (2) \\ I (X, Y) & = H (X) + H (Y) - H (X, Y) & \dots (3) \\ I (X, Y) & = H (X, Y) - H (X | Y) - H (Y | X) & \dots (4) \end{aligned}

$I(X,Y)$ ➛ Joint Entropy The Information Gain of feature A in dataset S.
$H(X)$ and $H(Y)$ ➛ The Shannon Entropy (original uncertainty) of the target variable $X$ and $Y$ respectively before the split.
$H(Y|X)$ and $H(X|Y)$ ➛ The Conditional Entropy of $Y$ given feature $X$ and The Conditional Entropy of $X$ given feature $Y$ respectively
The remaining uncertainty after the split).

Where it's used?

Feature Selection (Mutual Information): As noted in your manual, MI is calculated by subtracting the "uncertainty remaining" from the "total uncertainty". In this context, we use it to find features $X$ that result in the lowest Conditional Entropy for our target $Y$ . $$I(X; Y) = H(Y) - H(Y|X)$$
Machine Learning (Decision Trees): Algorithms like ID3 or C4.5 use Information Gain to split nodes. i.e choosing which question to ask first at a node to split data most efficiently.
Information Gain is simply the reduction in entropy $$\text{Gain} = H(\text{Parent}) - H(\text{Children}|\text{Split Condition})$$

Mutual Information vs Information Gain

Aspect	Mutual Information	Information Gain
Context	General measure between any two variables	Specific to decision trees
Formula	$I (X; Y) = H (Y) - H (Y ∣ X)$	$G a i n (S, A) = H (S) - H (S ∣ A)$
Usage	Feature selection, dependency analysis	Node splitting in decision trees
Symmetry	Yes: $I (X; Y) = I (Y; X)$	Not necessarily (we care about reducing target entropy)

Key Insight: Information Gain is just Mutual Information applied in the context of decision trees, where:

SS = Target variable (class labels)
AA = Attribute/feature used for splitting