I. Entropy 》II. Joint Entropy 》III. Conditional Entropy 》IV. Mutual Information 》V. Information Gain

I. Entropy (Shannon Entropy)

In information theory, Shannon Entropy ( $H$ ) is the mathematical measure of the average uncertainty, surprise, or information contained within a random variable.

In other words, Shannon Entropy is the fundamental measure of randomness or impurity in your data. In Data Science, we use it to understand how "spread out" or "mixed" a distribution is.

It forms the conceptual foundation for Mutual Information, which measures how much of this entropy is reduced when we know another variable.

Example

High Entropy $\Rightarrow$ More Randomness $\Rightarrow$ Heterogeneous. The data is unpredictable (e.g., a fair coin toss).
Low Entropy $\Rightarrow$ Less Randomness $\Rightarrow$ Homogeneous. The data is predictable

★ The Formula

For a discrete random variable $X$ with possible outcomes ${x_{1}, . . ., x_{n}}$ , the formula is:

H (X) = - \sum_{i = 1}^{n} P (x_{i}) \log_{b} P (x_{i})

Where

Probability $P (x_{i})$ : This is the likelihood of a specific outcome $x_{i}$ occurring.
Surprise/Information Content $\log_{b} P (x_{i})$ :
- Information is inversely proportional to probability. If an event is 100% certain, it provides zero information when it happens.
- If an event is rare, its occurrence is very "surprising" and contains high information.
The Negative Sign ( $-$ ): Since probabilities are between 0 and 1, their logarithms are negative. The leading negative sign ensures the final entropy value is positive.
The Base ( $b$ ):
- If $b = 2$ , entropy is measured in bits.
- If $b = e$ (natural log), it is measured in nats.

★ How to Interpret "Entropy" Value?

Think of this as the "Chaos Level" of your target variable before you know anything else.

High Entropy (close to 1.0 bit): The target is unpredictable (e.g., a 50/50 split by tossing fair coin).
Low Entropy (close to 0 bits): The target is highly predictable even without extra features (e.g., 99% of patients have no fever).
Goal:
- To quantify the level of uncertainty or "surprise" associated with a random variable.
- You want features that can significantly reduce this number.

★ The Weather Example

Imagine you live in a place where the weather is almost always Sunny.
Dataset: 9 days of Sun, 1 day of Rain.

$P (Sun) = 0.9$
$P (Rain) = 0.1$

\begin{aligned} H (Weather) & = - (0.9 \log_{2} 0.9) - (0.1 \log_{2} 0.1) \\ \approx - (0.9 \times - 0.152) - (0.1 \times - 3.32) \\ \approx 0.137 + 0.332 = 0.469 bits \end{aligned}

The Intuition: The entropy 0.469 indicates the weather is predictable. If you tell someone "It's sunny today," they aren't very surprised. However, notice that the "Rain" part of the math ( $0.332$ ) actually contributes more to the total entropy than the "Sun" part ( $0.137$ ).
Why? Because the rare event (Rain) carries more "surprise" or "information value" when it actually happens!