AdaBoost (Adaptive Boosting)

AdaBoost (Adaptive Boosting) is an ensemble machine learning algorithm that sequentially merges multiple weak learners known as Decision Stumps (trees with only one split and two terminal leaves) into a single strong classifier. .

Every time a stump is built, the algorithm's adaptive re-weighting mechanism looks at which data points it misclassified, increases their weight (importance), and lowers the importance of the correctly classified points.

👉 Random Forest vs AdaBoost

In Random Forest each time you make a tree, you make a full sized tree. Some trees might be bigger than others, but there is no predetermined maximum depth.
Each Decision Tree in Random Forest, can use all the variables.
In RF, each tree has equal vote in final classification.
Each Tree in RF are made independent to other trees.
In contrast, in AdaBoost, the trees are usually just a node and two leaves. (a.k.a "Stump").
Stumps can only use one variable for making decision, and thus each stump is a "weak learner"
In contrast, in a Forest of Stumps made with AdaBoost, some stumps get more say in the final classification than others.
In Forest of Stumps made by AdaBoost, order is important. The errors that the first stump makes... influence how second stump is made and so on..

Reference: AdaBoost vs Random Forest

Stumps

Stumps can only use one variable for making decision, and thus each stump is a "weak learner"

Three Core Principles of AdaBoost

Combines Weak Learners: AdaBoost merges many "weak learners" (models slightly better than random guessing) to make classifications. The weak learners are almost always decision stumps (trees with only one split).
Weighted Voting: Some stumps get more say in the final classification than others, based on their accuracy. Better-performing stumps have higher influence.
Sequential Learning: Each stump is built by taking the previous stump's mistakes into account. Misclassified samples receive higher weights, forcing the next stump to focus on these harder cases.

How AdaBoost Works: Step-by-Step

The Core Mechanism

AdaBoost is like a persistent teacher who keeps creating customized quizzes for students who struggle, while letting students who already understand practice on their own. The algorithm focuses its attention where it's needed most.

The Algorithm Process

Step 1: Initialize Sample Weights

Begin with all training samples having equal importance (weight)
Each sample weight: $w_{i} = \frac{1}{N}$ where $N$ is the total number of samples
Example: For 8 samples, each weight = $\frac{1}{8} = 0.125$

Step 2: Build a Weak Learner (Decision Stump)

Train a weak learner, typically a decision stump (one split decision tree)
Evaluate each feature to find the best split using a criterion like Gini index or entropy
Select the feature and threshold that gives the lowest error
Even this simple model will get some predictions right and some wrong

Step 3: Calculate Total Error

Measure the weighted error rate of this classifier
Error is the sum of weights of misclassified samples: $ϵ = \frac{\sum_{i \in misclassified} w_{i}}{\sum_{i = 1}^{N} w_{i}}$
This weighted error tells us how much to trust this classifier.
For equal weights, this simplifies to the proportion of misclassified samples.

Step 4: Calculate Classifier's "Amount of Say" (Alpha)

Determine how much influence this classifier should have in the final prediction
The formula is: $$\alpha = \frac{1}{2} \ln\left(\frac{1-\epsilon}{\epsilon}\right)$$
Understanding Alpha ( $α$ ):
If $ϵ \to 0$ (perfect classifier): $α$ becomes large and positive → strong influence
If $ϵ = 0.5$ (random guessing): $α = 0$ → no influence
If $ϵ \to 1$ (always wrong): $α$ becomes large and negative → vote gets flipped

Key Insight: The relationship is logarithmic—small improvements in error lead to disproportionately large increases in influence. A classifier with 10% error gets much more weight than one with 40% error.

Step 5: Update Sample Weights (The Adaptive Part)

This is where the "adaptive" in AdaBoost happens:

For misclassified samples:

w_{i}^{new} = w_{i}^{old} \times e^{α}

For correctly classified samples:

w_{i}^{new} = w_{i}^{old} \times e^{- α}

Then normalize all weights so they sum to 1:

w_{i}^{normalized} = \frac{w_{i}^{new}}{\sum_{j = 1}^{N} w_{j}^{new}}

What This Does:

Misclassified samples get exponentially higher weights (become more important)
Correctly classified samples get lower weights (become less important)
Next classifier will focus more on the currently misclassified samples

Step 6: Create New Training Dataset

Use updated sample weights as a probability distribution
Resample the dataset with replacement based on these weights
Samples with higher weights appear more frequently in the new dataset
The new dataset has the same size but emphasizes harder-to-classify examples

Step 7: Repeat

Train next weak learner on the re-weighted dataset
Continue for a fixed number of iterations or until error becomes too high

Step 8: Make Final Predictions

Each weak learner votes for a class
Votes are weighted by their $α$ values (amount of say)
For a sample $x$ , the final prediction is: $H (x) = sign (\sum_{t = 1}^{T} α_{t} h_{t} (x))$ where $h_{t} (x)$ is the prediction of the $t$ -th weak learner
The class with the highest weighted vote wins

Visual Example

👉 Excellent visual example ➛ StatQuest with Josh Starmer - AdaBoost

AdaBoost Characteristics

Strengths

Elegantly Simple: Easy to understand and implement
Minimal Tuning: Only need to choose number of iterations (rounds)
Strong Theory: Mathematically proven to reduce training error exponentially
Fast Training: Each weak learner is very simple (often just one split)
Works Out-of-Box: Good results without extensive hyperparameter tuning
Handles Non-linearity: Can capture complex decision boundaries with simple stumps
No Need for Feature Scaling: Tree-based weak learners are scale-invariant

Weaknesses

Outlier Sensitive: Noise and outliers get increasing weight, causing overfitting
Binary Classification Focus: Designed for two-class problems; extensions to multi-class are less elegant
Limited to Classification: Doesn't naturally extend to regression problems
Surpassed by Modern Methods: XGBoost and LightGBM generally perform better
Can Overfit: Running too many iterations without stopping can harm generalization
Sequential Training: Cannot be parallelized like Random Forests
Sensitive to Label Noise: Mislabeled data points can severely hurt performance

Python Implementation - Demo

Key Mathematical Formulas Reference

Component	Formula	Description
Initial Weights	$w_{i} = \frac{1}{N}$	All samples start with equal weight
Weighted Error	$ϵ_{t} = \frac{\sum_{i \in misclassified} w_{i}}{\sum_{i = 1}^{N} w_{i}}$	Proportion of weighted misclassifications
Stump Influence (Alpha)	$α_{t} = \frac{1}{2} \ln (\frac{1 - ϵ_{t}}{ϵ_{t}})$	How much say this stump gets in final vote
Update Weights (Correct)	$w_{i}^{new} = w_{i}^{old} \times e^{- α_{t}}$	Decrease weight for correctly classified
Update Weights (Wrong)	$w_{i}^{new} = w_{i}^{old} \times e^{α_{t}}$	Increase weight for misclassified
Normalize Weights	$w_{i}^{normalized} = \frac{w_{i}^{new}}{\sum_{j = 1}^{N} w_{j}^{new}}$	Ensure weights sum to 1
Final Prediction	$H (x) = sign (\sum_{t = 1}^{T} α_{t} h_{t} (x))$	Weighted vote of all stumps

Questions and Answers

1. How the New Dataset is Created 🔄

There are two primary methods that packages use to feed this "new dataset" into the next stump:

Method A: Proportional Resampling (The Roulette Wheel)

The algorithm creates a brand-new dataset of size $N$ by sampling with replacement from the original data. The probability of picking any single row is exactly equal to its updated weight.

Example: Imagine Data Point #5 was misclassified, so its weight ballooned to 0.40 (40%), while Data Point #2 was easy to classify, so its weight shrank to 0.01 (1%).

When drawing the new dataset, Data Point #5 will likely be copied into the new dataset multiple times, while Data Point #2 might be left out entirely. The next stump is forced to focus on Data Point #5 because it now appears everywhere!

Method B: Weighted Loss Function

Instead of physically copying rows, the original dataset remains identical, but the formula used to calculate the split's impurity (like Gini or Entropy) multiplies each row's penalty by its weight. Misclassifying a high-weight row penalizes the stump heavily.

2. What will happen if the next stump still gets that higher weight point wrong?

If you use Method A (Resampling) and a specific hard-to-classify data point gets duplicated five times into the new dataset, what will happen if the next stump still gets that point wrong?

Answer
The weight will change very slightly (or not at all)! This is one of the most fascinating mathematical quirks of AdaBoost. Here is why:

High Error Rate: If that duplicated point makes up a huge portion of the dataset and the new stump still gets it wrong, the stump's overall error rate will be very high (close to $50 %$ or more).
Low Stump Power ( $α$ ): A stump with a high error rate is no better than a random guess, so its voting power ( $α$ ) drops close to $0$ .
Small Update: Since the weight update formula multiplies the misclassified weights by $e^{α}$ , and $e^{0} = 1$ , the weights barely change. The algorithm essentially "gives up" on that round because the stump couldn't learn anything useful.