Decision Tree Classifier

A Decision Tree is a supervised machine learning algorithm that builds a flowchart-like tree structure to make predictions by recursively partitioning data based on the most informative feature at each step.

Decision trees create a model that predicts the label by evaluating a tree of if-then-else true/false feature questions, and estimating the minimum number of questions needed to assess the probability of making a correct decision.

Decision trees can be used for classification to predict a category, or regression to predict a continuous numeric value. In the simple example below, a decision tree is used to estimate a house price (the label) based on the size and number of bedrooms (the features).

ML_AI/images/dtree-1.png

Img Src. nvdia.com

I. Components

Component Role
Root Node Topmost node; represents the full dataset; holds the single best split
Internal (Decision) Node Tests one feature (e.g., Age > 30?); routes samples left or right
Leaf (Terminal) Node No further split; holds the predicted class label (majority class)
Branch / Edge The outcome of a test (True / False, or a threshold range)

II. Algorithm Steps

The algorithm is greedy and recursive: it makes the locally best split at every node without backtracking.

Step 1 – Start at the root with the full dataset.

Step 2 – For every feature, evaluate every possible split threshold and compute the impurity reduction (Information Gain):

Information Gain=Impurity(Parent)knknImpurity(Childk)

Step 3 – Choose the feature + threshold that gives the highest Information Gain.

Step 4 – Split the data and create two child nodes (CART) or multiple children (ID3 / C4.5).

Step 5 – Recurse on each child node until a stopping criterion is met:

Step 6 – Assign the majority class of remaining samples to each leaf.

★ Algorithms used
Algorithm Split Type Criterion Pruning Data Types Used for Outliers
CART
(✅ Default)
Binary only Gini or MSE Cost-Complexity Numerical + Categorical Classification, Regression Very Robust
ID3 Multi-way Information Gain (Entropy) Post-Pruning Categorical only Classification Sensitive
C4.5 Multi-way Gain Ratio Post-Pruning Both Classification, Regression Sensitive
CHAID Multi-way Chi-Square / F-test Pre-Pruning (Early Stop) Both Sensitive

III. Split Criteria (Metrics)

Set via the criterion= parameter in DecisionTreeClassifier.

i. Gini Impurity — criterion="gini" (default)

Gini=1i=1Cpi2

Use when: General-purpose classification; default choice; speed matters.

ii. Entropy — criterion="entropy"

Entropy=i=1Cpilog2(pi)

Use when: You want information-theoretic splits; studying feature selection; replicating ID3/C4.5 behaviour.

📌 In practice: Gini and Entropy almost always produce near-identical trees and accuracy. Start with Gini; switch to Entropy only if cross-validation shows a meaningful difference.

iii. Log Loss — criterion="log_loss" (sklearn ≥ 1.1)

Log Loss=i=1Cyilog(pi)

Use when: You need well-calibrated probabilities for downstream use — risk scoring, medical diagnosis, cost-sensitive decisions where predict_proba output matters more than raw accuracy.

IV. Binary vs. Multi-way Splits

i. Binary Split (CART — sklearn default)

Every node asks exactly one yes/no question. Categorical features are tested as feature ∈ {subset} vs feature ∉ {subset}.

Outlook == Overcast?
   ├── Yes → [Pure: Play=Yes]  ✓ Stop
   └── No  → [Mixed: Sunny + Rainy] → split further

Sklearn: ✅ Supported

ii. Multi-way Split (ID3 / C4.5 / CHAID)

Creates one branch per unique value of the chosen feature simultaneously.

Outlook?
   ├── Sunny    → [Mixed] → split further
   ├── Overcast → [Pure: Play=Yes] ✓ Stop
   └── Rainy    → [Mixed] → split further

Sklearn: ❌ Not supported

V. Pros and Cons

✅ Pros ⚠️ Cons
Highly interpretable — visualise and explain to non-experts Unstable — small data changes can produce a completely different tree
No feature scaling required Prone to overfitting — unbounded trees memorise noise
Handles numerical and categorical data Greedy — local optimal splits do not guarantee the globally best tree
Fast to train and predict Biased toward high-cardinality features (especially with entropy/ID3)
Captures non-linear relationships naturally Axis-aligned decision boundaries — struggles with diagonal patterns
Natural multi-class support Large trees become hard to interpret
Provides feature importances High variance — single DTs are weak learners, hence used in ensembles (RF, XGBoost)

VI. Code Examples

  1. Train, Evaluate & Visualise: Open in ColabOpen in Colab
  2. Bias–Variance Plot: Sweeping max_depth: Open in ColabOpen in Colab
  3. Post-Pruning: ccp_alpha Path: Open in ColabOpen in Colab
  4. GridSearchCV Hyperparameter Tuning: Open in ColabOpen in Colab
  5. Feature Importance Plot: Open in ColabOpen in Colab

Key attributes available after fitting:

clf.tree_.max_depth        # actual depth reached
clf.get_depth()            # same via public API
clf.get_n_leaves()         # number of leaf nodes
clf.feature_importances_   # Gini-based importance per feature
clf.classes_               # class labels in order

VII. Overfitting, Bias–Variance Trade-Off & pruning

VIII. Questions and Answers

1. How to pick the values for class_weight?

Most libraries (like Scikit-Learn) have a built-in setting called class_weight='balanced'.

The Formula:

It automatically calculates weights inversely proportional to class frequencies:

Wj=nsamplesnclasses×nsamples,j
Option B: Manual "Cost-Based" Weighting

Sometimes "Balanced" isn't enough. In finance, the cost of a False Negative (missing a default) is much higher than a False Positive (denying a good loan).

You can pass a dictionary: class_weight={0: 1, 1: 20}.

2. How manually setting class_weight changes the Math (Gini/Entropy)?

When you add weights, the "Purity" calculation changes.

Instead of:

P(i)=Count of Class iTotal Count

It becomes:

P(i)=Weights of Class iAll Weights

XI. External References