Decision Tree

I. Common Algorithms Used

★ CART — Classification and Regression Trees

Speciality:

Strictly binary splits;
Handles both classification (Gini/Entropy) and regression (MSE/MAE). In sklearn library they are handled by DecisionTreeClassifier and DecisionTreeRegressor respectively.

Advantages and Disadvantages

✅ Pros	⚠️ Cons
Binary splits → lower overfitting risk	Deeper trees than multi-way algorithms
Handles numerical + categorical data	Greedy; no guarantee of global optimum
Native pruning support (`ccp_alpha`)

★ ID3 — (Iterative Dichotomiser 3)

Speciality:

Pioneer algorithm. Picks the feature with the highest Information Gain (entropy-based). Produces multi-way splits — one branch per unique category value. The ID3 algorithm can be used to construct a decision tree for regression type problems by replacing Information Gain with Standard Deviation Reduction – SDR

Advantages and Disadvantages

✅ Pros	⚠️ Cons / 🚨 Caution
Simple and Highly Interpretable: - You can literally draw the tree and explain it to a client.	Categorical data only: - It struggles with continuous numbers (like $100.50$ ) unless you turn them into categories first.
Multi-way splits can be more natural	Bias: It prefers features with many categories (like "Date" or "ID number"), even if they aren't actually useful.
No Preprocessing: - Works great with categorical data (Yes/No, Red/Blue).	Overfitting: No built-in pruning ID3 tends to build very deep, complex trees that "memorize" noise.
Fast: It uses a "Greedy" approach, making the best choice at each moment.

🚨 Caution: Never use raw IDs, zip codes, or timestamps as features with ID3 — they will always be selected first and create useless, overfitting branches.

★ C4.5

Speciality:

Successor to ID3. It is more sophisticated than ID3, handles "noisy" data better, and is much more efficient. Replaces Information Gain with Gain Ratio to penalise high-cardinality features. Handles continuous attributes via threshold splitting. Grow as large as possible and then "prune" back.

Gain Ratio

Instead of just looking at Information Gain, C4.5 calculates Split Information, which measures how "broad" the split is:

Gain Ratio (A) = \frac{Information Gain (A)}{Split Information (A)}

By putting "Split Information" in the denominator, C4.5 effectively "taxes" features that have too many unique outcomes.

The Key Upgrades (Why it beat ID3)

A. Handling Continuous Attributes (Numbers)

ID3 could only handle categories (e.g., "Weather: Sunny, Rainy"). C4.5 can handle numbers (e.g., "Temperature: 72.5°F").

How? It sorts the numbers and creates a threshold (e.g., $T e m p > 70$ vs. $T e m p \leq 70$ ) based on which split provides the best information gain.

B. Handling Missing Values

In real datasets, some rows are always missing data. ID3 would crash or ignore them. C4.5 uses a probabilistic approach, assigning a weight to those records based on the distribution of the known values.

C. Pruning (The "Anti-Overfitting" Tool)

Decision trees often grow too deep, "memorizing" noise in the training data. C4.5 uses Post-Pruning: it builds the full tree and then looks at branches that don't actually help with prediction accuracy and "snips" them off.

D. Gain Ratio (Solving the "Bias" Problem)

ID3 had a massive flaw: it loved features with many unique values (like a "Date" or "Transaction ID"). These have high "Information Gain" but zero predictive power.

How? C4.5 uses the Gain Ratio, which penalizes features that have too many branches, ensuring the tree picks meaningful variables instead.

Choose when: You need multi-way splits but want to avoid the high-cardinality bias of ID3.

Advantages and Disadvantages

✅ Pros	⚠️ Cons
Fixes ID3's cardinality bias	Slower than CART
Handles continuous features and missing data	Not natively in sklearn
Supports post-pruning (C5.0 successor)

Today (Apr 2026), while we often use even more advanced versions like C5.0 (which is faster and uses less memory) or Random Forests, C4.5 remains the benchmark algorithm for "Explainable AI." Because the tree is pruned and uses Gain Ratio, the resulting logic is usually very clean and easy for a human to read.

Code example

1. C4.5 with the "Heart Disease" Dataset:

★ CHAID — Chi-square Automatic Interaction Detection

Speciality:

Unlike ID3 or C4.5, which use Information Theory (Entropy), CHAID uses Formal Statistics (Chi-square tests) to determine how to split your data.

The Math:

CHAID calculates the Pearson Chi-square statistic for every potential split.

For Categorical Targets: It measures the independence between the feature and the target.
For Numerical Targets: It uses F-tests (ANOVA) instead of Chi-square to see if the means of the groups are significantly different.

Differentiating Factors

CHAID is more conservative: it uses Significance Testing (p-values). If a split isn't statistically significant (usually $p > 0.05$ ), CHAID simply stops growing that branch. This prevents the tree from seeing "patterns" that are actually just random noise.
Naturally produces multi-way splits. Very interpretable in business settings.
Category Merging: Before splitting, CHAID looks at the categories of a feature. If two categories (like "Gold" and "Platinum" credit card holders) show nearly identical behavior, CHAID will merge them into a single group before making the split. This keeps the tree simple and "clean."

Choose when:

You need statistically validated splits, survey analysis, or marketing segmentation. e.g

Credit Risk: Banks love CHAID because they can explain exactly why a "p-value" triggered a rejection.
Direct Marketing: If you want to know which demographic (Age + Income + Location) is most likely to buy an ETF, CHAID creates easy-to-read segments.
Non-Linear Relationships: CHAID is great at finding interactions between variables (e.g., "Age" only matters if "Income" is above a certain level).

Advantages and Disadvantages

✅ Pros	⚠️ Cons
Statistically grounded splits	Requires sufficient sample size per split
Multi-way splits → shallower, more readable trees	Not in sklearn
Great for nominal / ordinal data	Can miss complex interactions

II. Summary

Overview

Algorithm	Split Type	Criterion	Pruning	Data Types	Used for	Outliers
CART (✅ Default)	Binary only	Gini or MSE	Cost-Complexity	Numerical + Categorical	Classification, Regression	Very Robust
ID3	Multi-way	Information Gain (Entropy)	Post-Pruning	Categorical only	Classification	Sensitive
C4.5	Multi-way	Gain Ratio	Post-Pruning	Both	Classification, Regression	Sensitive
CHAID	Multi-way	Chi-Square / F-test	Pre-Pruning (Early Stop)	Both		Sensitive