Decision Tree - Fitting it Right

I. Overfitting, Bias–Variance Trade-Off & Pruning

★ The Core Problem

An unconstrained decision tree grows until every leaf is pure — perfectly fitting the training data but failing on unseen data.

Unconstrained tree:  Train acc = 100%,  Test acc = 65%  ← Overfit
max_depth=4 tree:    Train acc = 85%,   Test acc = 83%  ← Generalises ✅
max_depth=2 tree:    Train acc = 72%,   Test acc = 71%  ← Underfit

★ Bias–Variance Trade-Off

Condition Tree Depth Bias Variance Result
Too shallow Small High ↑ Low ↓ Underfitting
Just right Tuned Balanced Balanced ✅ Good generalisation
Too deep Large Low ↓ High ↑ Overfitting
Approach 1 — Pre-Pruning (Hyperparameter Constraints)

Stop the tree from growing too large during training.

Parameter Type Effect Typical Range
max_depth int Hard cap on tree depth 3–10
min_samples_split int or float Min samples needed to attempt a split 2–20 (int) / 0.01–0.1 (float)
min_samples_leaf int or float Min samples required in any leaf 1–20 (int) / 0.01–0.05 (float)
min_impurity_decrease float Split only if IG ≥ this threshold 0.0–0.05
max_leaf_nodes int Cap on total leaves; grows best-first by IG 10–100
max_features int / float / "sqrt" / "log2" Features evaluated per split; adds randomness "sqrt" for RF-like behaviour

max_leaf_nodes — grows the tree best-first (highest IG first) rather than depth-first, then stops when the leaf count is reached. Often produces a better shape than max_depth alone because it always expands the most informative branch next.

max_features — at each split, only a random subset of features is evaluated. Reduces correlation between splits, acts as mild regularisation, and is the key mechanism that makes Random Forest effective.

Approach 2 — Post-Pruning with ccp_alpha

Train a full tree first, then prune back branches whose cost-complexity score falls below threshold α.

Cost-Complexity Criterion:

Rα(T)=R(T)+α|T|

Where R(T) = misclassification rate, |T| = number of leaves.

ccp_alpha value Effect
0.0 No pruning — full tree (overfits)
Small (0.001–0.01) Light pruning
Large (> 0.05) Heavy pruning (may underfit)

Always select ccp_alpha via cross-validation using the pruning path — never guess.

Approach 3 Hyperparameter Tuning

Cross-validate over multiple hyperparameters simultaneously to find the optimal combination.


II. Sklearn: Parameters Deep-Dive

1. ccp_alpha — Cost-Complexity Pruning

path = DecisionTreeClassifier(random_state=42)\
           .cost_complexity_pruning_path(X_train, y_train)
# path.ccp_alphas → array of alpha values from full tree → root-only tree
# path.impurities → corresponding total impurity at each alpha

Train one tree per alpha value, plot accuracy vs alpha, and pick the alpha that maximises cross-validation accuracy.

2. class_weight — Handling Class Imbalance

Adjusts the impurity calculation so minority classes are treated as more important during splitting.

# Auto-balance by inverse class frequency
DecisionTreeClassifier(class_weight='balanced')

# Manual weights  (minority class gets 5× the weight)
DecisionTreeClassifier(class_weight={0: 1, 1: 5})
Situation Recommendation
Balanced classes None (default)
Moderately imbalanced 'balanced'
Severely imbalanced / cost-sensitive fraud/medical Manual dict, e.g., {0:1, 1:10}

📌 Using class_weight is equivalent to oversampling the minority class but without duplicating rows.
🚫 If your dataset is extremely imbalanced (e.g., 1 in 10,000), class weights can make the Decision Tree very "jittery" and prone to overfitting on those few minority samples.

3. max_features — Feature Subsampling per Split

Value Features considered at each split Effect
None (default) All n features Deterministic, always finds the global best split
"sqrt" n_features Randomizes splits — Random Forest default
"log2" log2(n_features) More aggressive randomization
int Exact number Fine-grained control
float Fraction, e.g. 0.5 = 50% Proportional subsampling

Use "sqrt" when using a single DT as a weak learner, or to add regularisation without building a full ensemble.

4. max_leaf_nodes — Best-First Growth Cap

Growth order: at each step the leaf with the highest potential IG is expanded — not simply left-to-right by depth. This gives a more efficient use of the node budget.