Decision Tree - Fitting it Right
I. Overfitting, Bias–Variance Trade-Off & Pruning
★ The Core Problem
An unconstrained decision tree grows until every leaf is pure — perfectly fitting the training data but failing on unseen data.
Unconstrained tree: Train acc = 100%, Test acc = 65% ← Overfit
max_depth=4 tree: Train acc = 85%, Test acc = 83% ← Generalises ✅
max_depth=2 tree: Train acc = 72%, Test acc = 71% ← Underfit
★ Bias–Variance Trade-Off
| Condition | Tree Depth | Bias | Variance | Result |
|---|---|---|---|---|
| Too shallow | Small | High ↑ | Low ↓ | Underfitting |
| Just right | Tuned | Balanced | Balanced | ✅ Good generalisation |
| Too deep | Large | Low ↓ | High ↑ | Overfitting |
Approach 1 — Pre-Pruning (Hyperparameter Constraints)
Stop the tree from growing too large during training.
| Parameter | Type | Effect | Typical Range |
|---|---|---|---|
max_depth |
int | Hard cap on tree depth | 3–10 |
min_samples_split |
int or float | Min samples needed to attempt a split | 2–20 (int) / 0.01–0.1 (float) |
min_samples_leaf |
int or float | Min samples required in any leaf | 1–20 (int) / 0.01–0.05 (float) |
min_impurity_decrease |
float | Split only if IG ≥ this threshold | 0.0–0.05 |
max_leaf_nodes |
int | Cap on total leaves; grows best-first by IG | 10–100 |
max_features |
int / float / "sqrt" / "log2" | Features evaluated per split; adds randomness | "sqrt" for RF-like behaviour |
max_leaf_nodes — grows the tree best-first (highest IG first) rather than depth-first, then stops when the leaf count is reached. Often produces a better shape than max_depth alone because it always expands the most informative branch next.
max_features — at each split, only a random subset of features is evaluated. Reduces correlation between splits, acts as mild regularisation, and is the key mechanism that makes Random Forest effective.
Approach 2 — Post-Pruning with ccp_alpha
Train a full tree first, then prune back branches whose cost-complexity score falls below threshold
Cost-Complexity Criterion:
Where
| ccp_alpha value | Effect |
|---|---|
0.0 |
No pruning — full tree (overfits) |
| Small (0.001–0.01) | Light pruning |
| Large (> 0.05) | Heavy pruning (may underfit) |
Always select
ccp_alphavia cross-validation using the pruning path — never guess.
Approach 3 Hyperparameter Tuning
Cross-validate over multiple hyperparameters simultaneously to find the optimal combination.
II. Sklearn: Parameters Deep-Dive
1. ccp_alpha — Cost-Complexity Pruning
path = DecisionTreeClassifier(random_state=42)\
.cost_complexity_pruning_path(X_train, y_train)
# path.ccp_alphas → array of alpha values from full tree → root-only tree
# path.impurities → corresponding total impurity at each alpha
Train one tree per alpha value, plot accuracy vs alpha, and pick the alpha that maximises cross-validation accuracy.
2. class_weight — Handling Class Imbalance
Adjusts the impurity calculation so minority classes are treated as more important during splitting.
- By setting class_weight, you are telling the math: "Every time you misclassify a minority, it hurts
more than misclassifying a majority." - The tree will then favor splits that isolate the minority class, even if it makes the overall accuracy slightly lower.
# Auto-balance by inverse class frequency
DecisionTreeClassifier(class_weight='balanced')
# Manual weights (minority class gets 5× the weight)
DecisionTreeClassifier(class_weight={0: 1, 1: 5})
| Situation | Recommendation |
|---|---|
| Balanced classes | None (default) |
| Moderately imbalanced | 'balanced' |
| Severely imbalanced / cost-sensitive fraud/medical | Manual dict, e.g., {0:1, 1:10} |
📌 Using class_weight is equivalent to oversampling the minority class but without duplicating rows.
🚫 If your dataset is extremely imbalanced (e.g., 1 in 10,000), class weights can make the Decision Tree very "jittery" and prone to overfitting on those few minority samples.
3. max_features — Feature Subsampling per Split
| Value | Features considered at each split | Effect |
|---|---|---|
None (default) |
All |
Deterministic, always finds the global best split |
"sqrt" |
Randomizes splits — Random Forest default | |
"log2" |
More aggressive randomization | |
int |
Exact number | Fine-grained control |
float |
Fraction, e.g. 0.5 = 50% |
Proportional subsampling |
Use
"sqrt"when using a single DT as a weak learner, or to add regularisation without building a full ensemble.
4. max_leaf_nodes — Best-First Growth Cap
- Without max_leaf_nodes → depth-first, may waste splits on unimportant branches
- With max_leaf_nodes → always expands the most informative branch next
- bias-variance balance: Often gives a better bias-variance balance than max_depth alone, especially on imbalanced or multi-class data.
Growth order: at each step the leaf with the highest potential IG is expanded — not simply left-to-right by depth. This gives a more efficient use of the node budget.