Decision Tree - Fitting it Right

I. Overfitting, Bias–Variance Trade-Off & Pruning

★ The Core Problem

An unconstrained decision tree grows until every leaf is pure — perfectly fitting the training data but failing on unseen data.

Unconstrained tree:  Train acc = 100%,  Test acc = 65%  ← Overfit
max_depth=4 tree:    Train acc = 85%,   Test acc = 83%  ← Generalises ✅
max_depth=2 tree:    Train acc = 72%,   Test acc = 71%  ← Underfit

★ Bias–Variance Trade-Off

Condition	Tree Depth	Bias	Variance	Result
Too shallow	Small	High ↑	Low ↓	Underfitting
Just right	Tuned	Balanced	Balanced	✅ Good generalisation
Too deep	Large	Low ↓	High ↑	Overfitting

Approach 1 — Pre-Pruning (Hyperparameter Constraints)

Stop the tree from growing too large during training.

Parameter	Type	Effect	Typical Range
`max_depth`	int	Hard cap on tree depth	3–10
`min_samples_split`	int or float	Min samples needed to attempt a split	2–20 (int) / 0.01–0.1 (float)
`min_samples_leaf`	int or float	Min samples required in any leaf	1–20 (int) / 0.01–0.05 (float)
`min_impurity_decrease`	float	Split only if IG ≥ this threshold	0.0–0.05
`max_leaf_nodes`	int	Cap on total leaves; grows best-first by IG	10–100
`max_features`	int / float / "sqrt" / "log2"	Features evaluated per split; adds randomness	`"sqrt"` for RF-like behaviour

max_leaf_nodes — grows the tree best-first (highest IG first) rather than depth-first, then stops when the leaf count is reached. Often produces a better shape than max_depth alone because it always expands the most informative branch next.

max_features — at each split, only a random subset of features is evaluated. Reduces correlation between splits, acts as mild regularisation, and is the key mechanism that makes Random Forest effective.

Approach 2 — Post-Pruning with ccp_alpha

Train a full tree first, then prune back branches whose cost-complexity score falls below threshold $α$ .

Cost-Complexity Criterion:

R_{α} (T) = R (T) + α \cdot | T |

Where $R (T)$ = misclassification rate, $| T |$ = number of leaves.

ccp_alpha value	Effect
`0.0`	No pruning — full tree (overfits)
Small (0.001–0.01)	Light pruning
Large (> 0.05)	Heavy pruning (may underfit)

Always select ccp_alpha via cross-validation using the pruning path — never guess.

Approach 3 Hyperparameter Tuning

Cross-validate over multiple hyperparameters simultaneously to find the optimal combination.

II. Sklearn: Parameters Deep-Dive

1. ccp_alpha — Cost-Complexity Pruning

path = DecisionTreeClassifier(random_state=42)\
           .cost_complexity_pruning_path(X_train, y_train)
# path.ccp_alphas → array of alpha values from full tree → root-only tree
# path.impurities → corresponding total impurity at each alpha

Train one tree per alpha value, plot accuracy vs alpha, and pick the alpha that maximises cross-validation accuracy.

2. class_weight — Handling Class Imbalance

Adjusts the impurity calculation so minority classes are treated as more important during splitting.

By setting class_weight, you are telling the math: "Every time you misclassify a minority, it hurts $10 \times$ more than misclassifying a majority."
The tree will then favor splits that isolate the minority class, even if it makes the overall accuracy slightly lower.

# Auto-balance by inverse class frequency
DecisionTreeClassifier(class_weight='balanced')

# Manual weights  (minority class gets 5× the weight)
DecisionTreeClassifier(class_weight={0: 1, 1: 5})

Situation	Recommendation
Balanced classes	`None` (default)
Moderately imbalanced	`'balanced'`
Severely imbalanced / cost-sensitive fraud/medical	Manual dict, e.g., `{0:1, 1:10}`

📌 Using class_weight is equivalent to oversampling the minority class but without duplicating rows.
🚫 If your dataset is extremely imbalanced (e.g., 1 in 10,000), class weights can make the Decision Tree very "jittery" and prone to overfitting on those few minority samples.

3. max_features — Feature Subsampling per Split

Value	Features considered at each split	Effect
`None` (default)	All $n$ features	Deterministic, always finds the global best split
`"sqrt"`	$\sqrt{n_f e a t u r e s}$	Randomizes splits — Random Forest default
`"log2"`	$\log_{2} (n_f e a t u r e s)$	More aggressive randomization
`int`	Exact number	Fine-grained control
`float`	Fraction, e.g. `0.5` = 50%	Proportional subsampling

Use "sqrt" when using a single DT as a weak learner, or to add regularisation without building a full ensemble.

4. max_leaf_nodes — Best-First Growth Cap

Without max_leaf_nodes → depth-first, may waste splits on unimportant branches
With max_leaf_nodes → always expands the most informative branch next
bias-variance balance: Often gives a better bias-variance balance than max_depth alone, especially on imbalanced or multi-class data.

Growth order: at each step the leaf with the highest potential IG is expanded — not simply left-to-right by depth. This gives a more efficient use of the node budget.