Decision Tree Classifier

A Decision Tree is a supervised machine learning algorithm that builds a flowchart-like tree structure to make predictions by recursively partitioning data based on the most informative feature at each step.

I. Components

Component	Role
Root Node	Topmost node; represents the full dataset; holds the single best split
Internal (Decision) Node	Tests one feature (e.g., `Age > 30?`); routes samples left or right
Leaf (Terminal) Node	No further split; holds the predicted class label (majority class)
Branch / Edge	The outcome of a test (`True` / `False`, or a threshold range)

II. Algorithm Steps

The algorithm is greedy and recursive: it makes the locally best split at every node without backtracking.

Step 1 – Start at the root with the full dataset.

Step 2 – For every feature, evaluate every possible split threshold and compute the impurity reduction (Information Gain):

Information Gain = Impurity(Parent) - \sum_{k} \frac{n_{k}}{n} {Impurity(Child}_{k})

Step 3 – Choose the feature + threshold that gives the highest Information Gain.

Step 4 – Split the data and create two child nodes (CART) or multiple children (ID3 / C4.5).

Step 5 – Recurse on each child node until a stopping criterion is met:

Node is pure (all same class)
max_depth reached
min_samples_split / min_samples_leaf not satisfied
min_impurity_decrease threshold not met

Step 6 – Assign the majority class of remaining samples to each leaf.

III. Common Algorithms Used

Overview

Algorithm	Split Type	Criterion	Pruning	Data Types	Used for	Outliers
CART (✅ Default)	Binary only	Gini or MSE	Cost-Complexity	Numerical + Categorical	Classification, Regression	Very Robust
ID3	Multi-way	Information Gain (Entropy)	Post-Pruning	Categorical only	Classification	Sensitive
C4.5	Multi-way	Gain Ratio	Post-Pruning	Both	Classification, Regression	Sensitive
CHAID	Multi-way	Chi-Square / F-test	Pre-Pruning (Early Stop)	Both		Sensitive

★ CART — Classification and Regression Trees

Speciality:

Strictly binary splits;
Handles both classification (Gini/Entropy) and regression (MSE/MAE). In sklearn library they are handled by DecisionTreeClassifier and DecisionTreeRegressor respectively.

Advantages and Disadvantages

✅ Pros	⚠️ Cons
Binary splits → lower overfitting risk	Deeper trees than multi-way algorithms
Handles numerical + categorical data	Greedy; no guarantee of global optimum
Native pruning support (`ccp_alpha`)

★ ID3 — (Iterative Dichotomiser 3)

Speciality:

Pioneer algorithm. Picks the feature with the highest Information Gain (entropy-based). Produces multi-way splits — one branch per unique category value.

Advantages and Disadvantages

✅ Pros	⚠️ Cons / 🚨 Caution
Simple and Highly Interpretable: - You can literally draw the tree and explain it to a client.	Categorical data only: - It struggles with continuous numbers (like $100.50$ ) unless you turn them into categories first.
Multi-way splits can be more natural	Bias: It prefers features with many categories (like "Date" or "ID number"), even if they aren't actually useful.
No Preprocessing: - Works great with categorical data (Yes/No, Red/Blue).	Overfitting: No built-in pruning ID3 tends to build very deep, complex trees that "memorize" noise.
Fast: It uses a "Greedy" approach, making the best choice at each moment.

🚨 Caution: Never use raw IDs, zip codes, or timestamps as features with ID3 — they will always be selected first and create useless, overfitting branches.

★ C4.5

Speciality:

Successor to ID3. It is more sophisticated than ID3, handles "noisy" data better, and is much more efficient. Replaces Information Gain with Gain Ratio to penalise high-cardinality features. Handles continuous attributes via threshold splitting. Grow as large as possible and then "prune" back.

Gain Ratio

Instead of just looking at Information Gain, C4.5 calculates Split Information, which measures how "broad" the split is:

Gain Ratio (A) = \frac{Information Gain (A)}{Split Information (A)}

By putting "Split Information" in the denominator, C4.5 effectively "taxes" features that have too many unique outcomes.

The Key Upgrades (Why it beat ID3)

A. Handling Continuous Attributes (Numbers)

ID3 could only handle categories (e.g., "Weather: Sunny, Rainy"). C4.5 can handle numbers (e.g., "Temperature: 72.5°F").

How? It sorts the numbers and creates a threshold (e.g., $T e m p > 70$ vs. $T e m p \leq 70$ ) based on which split provides the best information gain.

B. Handling Missing Values

In real datasets, some rows are always missing data. ID3 would crash or ignore them. C4.5 uses a probabilistic approach, assigning a weight to those records based on the distribution of the known values.

C. Pruning (The "Anti-Overfitting" Tool)

Decision trees often grow too deep, "memorizing" noise in the training data. C4.5 uses Post-Pruning: it builds the full tree and then looks at branches that don't actually help with prediction accuracy and "snips" them off.

D. Gain Ratio (Solving the "Bias" Problem)

ID3 had a massive flaw: it loved features with many unique values (like a "Date" or "Transaction ID"). These have high "Information Gain" but zero predictive power.

How? C4.5 uses the Gain Ratio, which penalizes features that have too many branches, ensuring the tree picks meaningful variables instead.

Choose when: You need multi-way splits but want to avoid the high-cardinality bias of ID3.

Advantages and Disadvantages

✅ Pros	⚠️ Cons
Fixes ID3's cardinality bias	Slower than CART
Handles continuous features and missing data	Not natively in sklearn
Supports post-pruning (C5.0 successor)

Today (Apr 2026), while we often use even more advanced versions like C5.0 (which is faster and uses less memory) or Random Forests, C4.5 remains the benchmark algorithm for "Explainable AI." Because the tree is pruned and uses Gain Ratio, the resulting logic is usually very clean and easy for a human to read.

Code example

1. C4.5 with the "Heart Disease" Dataset:

★ CHAID — Chi-square Automatic Interaction Detection

Speciality:

Unlike ID3 or C4.5, which use Information Theory (Entropy), CHAID uses Formal Statistics (Chi-square tests) to determine how to split your data.

The Math:

CHAID calculates the Pearson Chi-square statistic for every potential split.

For Categorical Targets: It measures the independence between the feature and the target.
For Numerical Targets: It uses F-tests (ANOVA) instead of Chi-square to see if the means of the groups are significantly different.

Differentiating Factors

CHAID is more conservative: it uses Significance Testing (p-values). If a split isn't statistically significant (usually $p > 0.05$ ), CHAID simply stops growing that branch. This prevents the tree from seeing "patterns" that are actually just random noise.
Naturally produces multi-way splits. Very interpretable in business settings.
Category Merging: Before splitting, CHAID looks at the categories of a feature. If two categories (like "Gold" and "Platinum" credit card holders) show nearly identical behavior, CHAID will merge them into a single group before making the split. This keeps the tree simple and "clean."

Choose when:

You need statistically validated splits, survey analysis, or marketing segmentation. e.g

Credit Risk: Banks love CHAID because they can explain exactly why a "p-value" triggered a rejection.
Direct Marketing: If you want to know which demographic (Age + Income + Location) is most likely to buy an ETF, CHAID creates easy-to-read segments.
Non-Linear Relationships: CHAID is great at finding interactions between variables (e.g., "Age" only matters if "Income" is above a certain level).

Advantages and Disadvantages

✅ Pros	⚠️ Cons
Statistically grounded splits	Requires sufficient sample size per split
Multi-way splits → shallower, more readable trees	Not in sklearn
Great for nominal / ordinal data	Can miss complex interactions

IV. Split Criteria (Metrics)

Set via the criterion= parameter in DecisionTreeClassifier.

i. Gini Impurity — `criterion="gini"` (default)

Gini = 1 - \sum_{i = 1}^{C} p_{i}^{2}

Range: 0 (pure) → 0.5 (binary, perfectly mixed)
Intuition: Probability of incorrectly labelling a randomly picked sample using the node's class distribution.
Speed: ⚡ Fastest — no logarithm required.

Use when: General-purpose classification; default choice; speed matters.

ii. Entropy — `criterion="entropy"`

Entropy = - \sum_{i = 1}^{C} p_{i} \log_{2} (p_{i})

Range: 0 (pure) → 1 (binary, perfectly mixed) → $\log_{2} C$ (C-class max)
Intuition: Average information needed to identify the class of a sample — measures disorder.
Speed: ~10–20% slower than Gini due to log calculation.

Use when: You want information-theoretic splits; studying feature selection; replicating ID3/C4.5 behaviour.

📌 In practice: Gini and Entropy almost always produce near-identical trees and accuracy. Start with Gini; switch to Entropy only if cross-validation shows a meaningful difference.

iii. Log Loss — `criterion="log_loss"` (sklearn ≥ 1.1)

Log Loss = - \sum_{i = 1}^{C} y_{i} \log (p_{i})

Intuition: Penalises confident wrong predictions heavily; directly optimises probability calibration rather than classification boundaries.
Speed: Slowest of the three.

Use when: You need well-calibrated probabilities for downstream use — risk scoring, medical diagnosis, cost-sensitive decisions where predict_proba output matters more than raw accuracy.

V. Binary vs. Multi-way Splits

i. Binary Split (CART — sklearn default)

Every node asks exactly one yes/no question. Categorical features are tested as feature ∈ {subset} vs feature ∉ {subset}.

Produces a deeper but narrower tree.
Lower overfitting risk — small categories get grouped together rather than isolated branches.

Outlook == Overcast?
   ├── Yes → [Pure: Play=Yes]  ✓ Stop
   └── No  → [Mixed: Sunny + Rainy] → split further

ii. Multi-way Split (ID3 / C4.5 / CHAID)

Creates one branch per unique value of the chosen feature simultaneously.

Produces a wider but shallower tree.
Higher overfitting risk — high-cardinality features produce many tiny, pure branches that don't generalise.

Outlook?
   ├── Sunny    → [Mixed] → split further
   ├── Overcast → [Pure: Play=Yes] ✓ Stop
   └── Rainy    → [Mixed] → split further

Comparison

	Binary (CART)	Multi-way (ID3 / C4.5 / CHAID)
Branches per node	Always 2	One per unique category value
Tree shape	Deep, narrow	Shallow, wide
Overfitting risk	Lower	Higher on high-cardinality features
Sklearn	✅ Native	❌ Not supported
Best for	General use, continuous features	Categorical-heavy, interpretable trees

VI. Pros and Cons

✅ Pros	⚠️ Cons
Highly interpretable — visualise and explain to non-experts	Unstable — small data changes can produce a completely different tree
No feature scaling required	Prone to overfitting — unbounded trees memorise noise
Handles numerical and categorical data	Greedy — local optimal splits do not guarantee the globally best tree
Fast to train and predict	Biased toward high-cardinality features (especially with entropy/ID3)
Captures non-linear relationships naturally	Axis-aligned decision boundaries — struggles with diagonal patterns
Natural multi-class support	Large trees become hard to interpret
Provides feature importances	High variance — single DTs are weak learners, hence used in ensembles (RF, XGBoost)

VII. Overfitting, Bias–Variance Trade-Off & Pruning

The Core Problem

An unconstrained decision tree grows until every leaf is pure — perfectly fitting the training data but failing on unseen data.

Unconstrained tree:  Train acc = 100%,  Test acc = 65%  ← Overfit
max_depth=4 tree:    Train acc = 85%,   Test acc = 83%  ← Generalises ✅
max_depth=2 tree:    Train acc = 72%,   Test acc = 71%  ← Underfit

Bias–Variance Trade-Off

Condition	Tree Depth	Bias	Variance	Result
Too shallow	Small	High ↑	Low ↓	Underfitting
Just right	Tuned	Balanced	Balanced	✅ Good generalisation
Too deep	Large	Low ↓	High ↑	Overfitting

Approach 1 — Pre-Pruning (Hyperparameter Constraints)

Stop the tree from growing too large during training.

Parameter	Type	Effect	Typical Range
`max_depth`	int	Hard cap on tree depth	3–10
`min_samples_split`	int or float	Min samples needed to attempt a split	2–20 (int) / 0.01–0.1 (float)
`min_samples_leaf`	int or float	Min samples required in any leaf	1–20 (int) / 0.01–0.05 (float)
`min_impurity_decrease`	float	Split only if IG ≥ this threshold	0.0–0.05
`max_leaf_nodes`	int	Cap on total leaves; grows best-first by IG	10–100
`max_features`	int / float / "sqrt" / "log2"	Features evaluated per split; adds randomness	`"sqrt"` for RF-like behaviour

max_leaf_nodes — grows the tree best-first (highest IG first) rather than depth-first, then stops when the leaf count is reached. Often produces a better shape than max_depth alone because it always expands the most informative branch next.

max_features — at each split, only a random subset of features is evaluated. Reduces correlation between splits, acts as mild regularisation, and is the key mechanism that makes Random Forest effective.

Approach 2 — Post-Pruning with ccp_alpha

Train a full tree first, then prune back branches whose cost-complexity score falls below threshold $α$ .

Cost-Complexity Criterion:

R_{α} (T) = R (T) + α \cdot | T |

Where $R (T)$ = misclassification rate, $| T |$ = number of leaves.

ccp_alpha value	Effect
`0.0`	No pruning — full tree (overfits)
Small (0.001–0.01)	Light pruning
Large (> 0.05)	Heavy pruning (may underfit)

Always select ccp_alpha via cross-validation using the pruning path — never guess.

Approach 3 — GridSearchCV

Cross-validate over multiple hyperparameters simultaneously to find the optimal combination.

VIII. Special Parameters Deep-Dive

1. ccp_alpha — Cost-Complexity Pruning

path = DecisionTreeClassifier(random_state=42)\
           .cost_complexity_pruning_path(X_train, y_train)
# path.ccp_alphas → array of alpha values from full tree → root-only tree
# path.impurities → corresponding total impurity at each alpha

Train one tree per alpha value, plot accuracy vs alpha, and pick the alpha that maximises cross-validation accuracy.

2. class_weight — Handling Class Imbalance

Adjusts the impurity calculation so minority classes are treated as more important during splitting.

By setting class_weight, you are telling the math: "Every time you misclassify a minority, it hurts $10 \times$ more than misclassifying a majority."
The tree will then favor splits that isolate the minority class, even if it makes the overall accuracy slightly lower.

# Auto-balance by inverse class frequency
DecisionTreeClassifier(class_weight='balanced')

# Manual weights  (minority class gets 5× the weight)
DecisionTreeClassifier(class_weight={0: 1, 1: 5})

Situation	Recommendation
Balanced classes	`None` (default)
Moderately imbalanced	`'balanced'`
Severely imbalanced / cost-sensitive fraud/medical	Manual dict, e.g., `{0:1, 1:10}`

📌 Using class_weight is equivalent to oversampling the minority class but without duplicating rows.
🚫 If your dataset is extremely imbalanced (e.g., 1 in 10,000), class weights can make the Decision Tree very "jittery" and prone to overfitting on those few minority samples.

3. max_features — Feature Subsampling per Split

Value	Features considered at each split	Effect
`None` (default)	All $n$ features	Deterministic, always finds the global best split
`"sqrt"`	$\sqrt{n_f e a t u r e s}$	Randomizes splits — Random Forest default
`"log2"`	$\log_{2} (n_f e a t u r e s)$	More aggressive randomization
`int`	Exact number	Fine-grained control
`float`	Fraction, e.g. `0.5` = 50%	Proportional subsampling

Use "sqrt" when using a single DT as a weak learner, or to add regularisation without building a full ensemble.

4. max_leaf_nodes — Best-First Growth Cap

Without max_leaf_nodes → depth-first, may waste splits on unimportant branches
With max_leaf_nodes → always expands the most informative branch next
bias-variance balance: Often gives a better bias-variance balance than max_depth alone, especially on imbalanced or multi-class data.

Growth order: at each step the leaf with the highest potential IG is expanded — not simply left-to-right by depth. This gives a more efficient use of the node budget.

IX. Code Examples

Train, Evaluate & Visualise:
Bias–Variance Plot: Sweeping max_depth:
Post-Pruning: ccp_alpha Path:
GridSearchCV Hyperparameter Tuning:
Feature Importance Plot:

X. Summary Cheat-Sheet

Question	Answer
Default sklearn criterion?	`gini`
When to use `log_loss`?	When calibrated probabilities matter (risk, medical)
Binary vs multi-way?	sklearn = binary (CART); ID3/C4.5 = multi-way
First step to prevent overfitting?	Set `max_depth` + validate with CV
When to use `ccp_alpha`?	Post-training pruning — select via pruning path plot
When to use `class_weight='balanced'`?	Any imbalanced classification task
When to use `max_features='sqrt'`?	Regularisation or building trees for ensembles
`max_leaf_nodes` vs `max_depth`?	`max_leaf_nodes` is often better — grows highest-IG splits first
ID3 biggest risk?	High-cardinality features dominate (use C4.5 or CART instead)

Key attributes available after fitting:

clf.tree_.max_depth        # actual depth reached
clf.get_depth()            # same via public API
clf.get_n_leaves()         # number of leaf nodes
clf.feature_importances_   # Gini-based importance per feature
clf.classes_               # class labels in order

XI. Questions and Answers

1. How to pick the values for class_weight?

Option A: The "Balanced" Auto-Pilot (Recommended)

Most libraries (like Scikit-Learn) have a built-in setting called class_weight='balanced'.

The Formula:

It automatically calculates weights inversely proportional to class frequencies:

W_{j} = \frac{n_{s a m p l e s}}{n_{c l a s s e s} \times n_{s a m p l e s, j}}

Example: If you have 100 samples (90 "Safe", 10 "Risk"):
- Weight for Safe: $100 / (2 \times 90) = 0.55$
- Weight for Risk: $100 / (2 \times 10) = 5.0$
Result: The model treats each "Risk" instance as being $9 \times$ more important than a "Safe" instance.

Option B: Manual "Cost-Based" Weighting

Sometimes "Balanced" isn't enough. In finance, the cost of a False Negative (missing a default) is much higher than a False Positive (denying a good loan).

You can pass a dictionary: class_weight={0: 1, 1: 20}.

Pick this when you have a specific Business Cost associated with an error. If a missed "Class 1" event costs the firm $20 k$ and a false alarm costs $1 k$ , a $1 : 20$ ratio is mathematically sound.

How manually setting class_weight changes the Math (Gini/Entropy)?

When you add weights, the "Purity" calculation changes.

Instead of:

P (i) = \frac{Count of Class i}{Total Count}

It becomes:

P (i) = \frac{\sum Weights of Class i}{\sum All Weights}

Decision Tree Classifier

I. Components

II. Algorithm Steps

III. Common Algorithms Used

Overview

★ CART — Classification and Regression Trees

Speciality:

Advantages and Disadvantages

★ ID3 — (Iterative Dichotomiser 3)

Speciality:

Advantages and Disadvantages

★ C4.5

Speciality:

Gain Ratio

The Key Upgrades (Why it beat ID3)

A. Handling Continuous Attributes (Numbers)

B. Handling Missing Values

C. Pruning (The "Anti-Overfitting" Tool)

D. Gain Ratio (Solving the "Bias" Problem)

Advantages and Disadvantages

Code example

★ CHAID — Chi-square Automatic Interaction Detection

Speciality:

The Math:

Differentiating Factors

Choose when:

Advantages and Disadvantages

IV. Split Criteria (Metrics)

i. Gini Impurity — criterion="gini" (default)

ii. Entropy — criterion="entropy"

iii. Log Loss — criterion="log_loss" (sklearn ≥ 1.1)

V. Binary vs. Multi-way Splits

i. Binary Split (CART — sklearn default)

ii. Multi-way Split (ID3 / C4.5 / CHAID)

Comparison

VI. Pros and Cons

VII. Overfitting, Bias–Variance Trade-Off & Pruning

The Core Problem

Bias–Variance Trade-Off

Approach 1 — Pre-Pruning (Hyperparameter Constraints)

Approach 2 — Post-Pruning with ccp_alpha

Approach 3 — GridSearchCV

VIII. Special Parameters Deep-Dive

1. ccp_alpha — Cost-Complexity Pruning

2. class_weight — Handling Class Imbalance

3. max_features — Feature Subsampling per Split

4. max_leaf_nodes — Best-First Growth Cap

IX. Code Examples

X. Summary Cheat-Sheet

XI. Questions and Answers

1. How to pick the values for class_weight?

Option A: The "Balanced" Auto-Pilot (Recommended)

Option B: Manual "Cost-Based" Weighting

How manually setting class_weight changes the Math (Gini/Entropy)?

XII. External References

i. Gini Impurity — `criterion="gini"` (default)

ii. Entropy — `criterion="entropy"`

iii. Log Loss — `criterion="log_loss"` (sklearn ≥ 1.1)