I. EntropyII. Joint EntropyIII. Conditional EntropyIV. Mutual InformationV. Information Gain


IV. Mutual Information (MI)

What is Mutual Information?

Mutual Information (MI) quantifies the amount of information obtained about one random variable by observing another random variable. In simpler terms, it measures how much knowing the value of variable X reduces uncertainty about variable Y (and vice versa).

Conditional Entropy is the "workhorse" behind Mutual Information (MI), which is the primary metric used in your feature selection to rank how relevant a feature X is to a target Y.
Learning/images/mi-1.png

Mathematical Properties

Symmetry
I(X;Y)=I(Y;X)

Unlike conditional entropy, MI is symmetric. It doesn't matter which variable you consider first.

Relationship to Entropy
I(X;Y)=H(X)+H(Y)H(X,Y)

This can be visualized as a Venn diagram:

Alternative Formulations

All these are equivalent:

  1. Via Conditional Entropy:
I(X;Y)=H(Y)H(YX)=H(X)H(XY)
  1. Via Joint Entropy:
I(X;Y)=H(X)+H(Y)H(X,Y)
  1. Via Both Conditional Entropies:
I(X;Y)=H(X,Y)H(XY)H(YX)

Intuitive Interpretation

Scenario MI Value Interpretation
X and Y are independent I(X;Y)=0 No shared information
X and Y are perfectly correlated I(X;Y)=H(X)=H(Y) Complete information overlap
X and Y are partially related 0<I(X;Y)<H(X) Some shared information

V. Information Gain (IG)

Entropy tells us how impure a set of data is. Information Gain (IG) tells us how much that impurity is reduced after we split the data on a particular feature.

Information Gain (IG) is the reduction in entropy achieved by partitioning a dataset based on an attribute.

In other words, IG measures how much "information" a feature provides about the target class. A feature that creates very pure subgroups (low entropy) after a split has a high Information Gain.

Ever wonder

The answer lies in a powerful concept borrowed from information theory: Entropy and Information Gain (IG).


In Decision Trees (like ID3, C4.5), Information Gain is the practical application of Mutual Information. It measures the change in entropy from a "state of ignorance" to a "state of knowledge" after a split.
In other words, Information Gain is the standard metric used in decision trees to determine the best feature for a split. It measures how much "entropy" is removed from the target variable after partitioning the data based on a specific feature.

★ Mathematical Formula and derivation

MI is calculated by subtracting the "uncertainty remaining" from the "total uncertainty":

Information Gain (Target S, Independent variable X)=Entropy(S) - Conditional Entropy(S|X)I(X,Y)=H(Y)H(Y|X)(1)I(X,Y)=H(X)H(X|Y)(2)I(X,Y)=H(X)+H(Y)H(X,Y)(3) I(X,Y)=H(X,Y)H(X|Y)H(Y|X)(4)
Where it's used?

  • Feature Selection (Mutual Information): As noted in your manual, MI is calculated by subtracting the "uncertainty remaining" from the "total uncertainty". In this context, we use it to find features X that result in the lowest Conditional Entropy for our target Y. $$I(X; Y) = H(Y) - H(Y|X)$$
  • Machine Learning (Decision Trees): Algorithms like ID3 or C4.5 use Information Gain to split nodes. i.e choosing which question to ask first at a node to split data most efficiently.
    Information Gain is simply the reduction in entropy $$\text{Gain} = H(\text{Parent}) - H(\text{Children}|\text{Split Condition})$$

Mutual Information vs Information Gain

Aspect Mutual Information Information Gain
Context General measure between any two variables Specific to decision trees
Formula I(X;Y)=H(Y)H(YX) Gain(S,A)=H(S)H(SA)
Usage Feature selection, dependency analysis Node splitting in decision trees
Symmetry Yes: I(X;Y)=I(Y;X) Not necessarily (we care about reducing target entropy)

Key Insight: Information Gain is just Mutual Information applied in the context of decision trees, where: