Hyperparameter Tuning

I. Parameters vs. Hyperparameters

Before tuning anything, it's important to understand what you're actually tuning.

Parameters Hyperparameters
What they are Values learned by the model during training Values set by you before training begins
Examples Weights in a neural network, coefficients in linear regression Learning rate, tree depth, number of estimators
How they're set Optimized automatically (e.g., via Gradient Descent) Chosen manually or via a tuning strategy

Why does this matter? Your model's parameters are only as good as the hyperparameters that govern the training process. A poorly tuned learning rate can prevent your model from converging entirely. A tree that's too deep will memorize training data and fail on new data. Hyperparameter tuning is the process of systematically finding the settings that produce the best generalizing model.

II. The Role of Cross-Validation in Tuning

All tuning methods below rely on cross-validation (CV) to evaluate each hyperparameter combination honestly.

Instead of evaluating on a single train/test split (which can be lucky or unlucky), k-fold cross-validation:

  1. Splits the training data into k equal folds
  2. Trains on k1 folds and validates on the remaining fold
  3. Repeats k times, rotating which fold is held out
  4. Reports the average validation score across all k runs

This gives a much more reliable estimate of how each hyperparameter combination will perform on unseen data. The cv=5 parameter you see in every sklearn example below is doing exactly this.

⚠️ A word of caution: Tuning hyperparameters on your validation set and then reporting that score as your final result is a form of data leakage. Always keep a final held-out test set that is never used during tuning. Evaluate on it exactly once, at the end.

III. Tuning Methods

Concept: Define an explicit list of values for each hyperparameter. Grid search evaluates every possible combination — it is an exhaustive search.

When to use it: When your hyperparameter space is small and you can afford the computation. Good for final fine-tuning around a known good region.

Pros Cons
Exhaustive: Guaranteed to find the absolute best combination within the exact values you provided. Curse of Dimensionality: Adding just one new parameter exponentially increases the compute time.
Scales exponentially. 3 parameters with 5 values each = 53=125 fits. Add one more parameter and you're at 625.
Parallelizable: Since each test is independent, you can run all combinations simultaneously across multiple GPUs or servers. Wastes Resources: It spends equal time testing combinations it should logically know are terrible.
Simplicity: Incredibly easy to code, understand, and explain to stakeholders. Rigid Boundaries: It will never find a better value that exists ""between"" your grid points.

GridSearchCV — Sklearn Implementation

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [4, 5, 6, 7, 8]
}
# 2 x 5 = 10 combinations, each evaluated 5 times via CV = 50 model fits

clf = GridSearchCV(
    estimator=DecisionTreeClassifier(),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1  # use all available CPU cores
)

clf.fit(X_train, y_train)

print("Best parameters:", clf.best_params_)
print("Best CV score:  ", clf.best_score_)

Concept: Instead of trying every combination, randomly sample a fixed number of combinations from the hyperparameter space. You control the budget via n_iter.

When to use it: When your hyperparameter space is large and grid search would be too slow. A good first pass to identify which hyperparameters matter most and where good values tend to live.

Pros Cons
Highly Efficient: Often finds a near-optimal model in a fraction of the time it takes Grid Search. No Guarantees: Because it relies on chance, it might miss the absolute global maximum.
Finds In-Between Values: By using continuous distributions, it can test highly specific numbers (e.g., a learning rate of 0.0314) that a rigid grid would miss. Blind Search: Like Grid Search, it does not learn from its mistakes. It might test a terrible parameter space multiple times by random chance.
Strict Budget Control: You dictate exactly how many iterations it runs, capping your maximum compute cost. Suboptimal for Small Spaces: If you only have a few parameters to test, Grid Search is safer.

RandomizedSearchCV — Sklearn Implementation

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
# Instead of all 4x3x3 = 36 combinations, we only evaluate n_iter=10

clf = RandomizedSearchCV(
    estimator=DecisionTreeClassifier(),
    param_distributions=param_dist,
    n_iter=10,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)

clf.fit(X_train, y_train)

print("Best parameters:", clf.best_params_)
print("Best CV score:  ", clf.best_score_)

3. Successive Halving

Concept: Start with many hyperparameter combinations but allocate very little data to each. After each round ("halving"), eliminate the bottom half of candidates and give the survivors more data and more training time. Repeat until one winner remains.

When to use it: When you have a large dataset and a large search space. It's significantly faster than standard Grid or Random Search because poor candidates are eliminated early before wasting full training resources on them.

Pros Cons
Massive Throughput: You can screen thousands of combinations simultaneously without spending full compute on the bad ones. The "Late Bloomer" Risk: Some models start with terrible validation scores but become excellent if given enough time. Halving kills these prematurely.
Highly Resource Efficient: Focuses 90% of your computing budget on the top 10% of candidates. Noisy Early Data: If the initial subset of data is unrepresentative, the algorithm might promote the wrong models to the next round.
Scalable: Easily wraps around Grid or Randomized Search to instantly speed them up. Parameter Sensitivity: You must carefully balance the ""halving factor"" and the ""minimum resources,"" which adds complexity to the setup.

Sklearn provides two variants:

HalvingRandomSearchCV — Sklearn Implementation

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import HalvingRandomSearchCV

param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

clf = HalvingRandomSearchCV(
    estimator=DecisionTreeClassifier(),
    param_distributions=param_dist,
    factor=2,        # halve candidates each round
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)

clf.fit(X_train, y_train)

print("Best parameters:", clf.best_params_)
print("Best CV score:  ", clf.best_score_)

4. Bayesian Optimization

Problem: Actual machine learning model, We want to minimize (or maximize) an objective function:

f(θ)=validation loss (or error) where θ are hyperparameters

But f(θ) is:

  1. The Surrogate Model (The "Map")
    Instead, Bayesian Optimization builds a a probabilistic surrogate model of the objective function (validation score vs. hyperparameters). In most professional libraries, this is powered by a Gaussian Process (GP).
    • The GP looks at the hyperparameters you have tested so far and draws a curve predicting how the model will perform across all the un-tested parameters.
  2. The Acquisition Function (The "Decision Maker")
    Now that you have a map with predictions and uncertainty, you need a strategy for where to test next. This is handled by the Acquisition Function.
    The Acquisition Function scans the Surrogate Model and calculates a score for every possible hyperparameter combination by balancing two competing forces:
    • Exploitation: Testing parameters in areas where the Surrogate Model predicts excellent performance (drilling where you already found traces of oil).
    • Exploration: Testing parameters in areas where the Surrogate Model is highly uncertain (drilling in completely uncharted territory, just in case there is a massive hidden reserve).
      A standard mathematical formula used here is Expected Improvement (EI), which quantifies exactly how much better a new point is likely to be compared to the best result you have found so far.

When to use it: When each model fit is expensive (large datasets, deep networks) and you can only afford a limited number of evaluations. It typically finds near-optimal hyperparameters in far fewer iterations than random search.

Pros Cons
Learns from Mistakes: It never wastes time on combinations that its internal math predicts will fail. Inherently Sequential: Because step 3 relies on the results of step 2, it is extremely difficult to parallelize across multiple GPUs.
Best for Expensive Models: If a single training run takes 6 hours (like Deep Learning or massive XGBoost ensembles), Bayesian minimizes the total number of runs needed. Overhead Cost: The Bayesian math itself takes a few seconds to calculate. If your model trains in 0.1 seconds, the tuning engine takes longer than the model training.
Handles Complex Spaces: Excellent at navigating continuous hyperparameter distributions to find the exact decimal sweet spot. Complex Implementation: Harder to set up from scratch, usually requiring specialized libraries (Optuna, Hyperopt, SMAC).

BayesSearchCV — Sklearn Implementation

# Install dependency first: pip install scikit-optimize

import numpy as np
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define continuous and categorical search spaces
search_space = {
    'C':      Real(1e-6, 1e+6, prior='log-uniform'),  # regularization strength
    'gamma':  Real(1e-6, 1e+1, prior='log-uniform'),  # kernel coefficient
    'kernel': Categorical(['linear', 'rbf'])
}

opt = BayesSearchCV(
    estimator=SVC(),
    search_spaces=search_space,
    n_iter=50,           # number of hyperparameter evaluations
    scoring='accuracy',
    cv=5,
    random_state=42,
    n_jobs=-1
)

opt.fit(X_train, y_train)

print("Best parameters:", opt.best_params_)
print("Best CV score:  ", opt.best_score_)

# Evaluate once on the held-out test set
y_pred = opt.best_estimator_.predict(X_test)
print("Test accuracy:  ", accuracy_score(y_test, y_pred))

Search space types (from skopt.space):


IV. When to Use Which Method

Method Search Space Size Compute Budget Best For
Grid Search Small High Final fine-tuning in a known good region
Randomized Search Large Medium First-pass exploration, identifying important hyperparameters
Successive Halving Large Low–Medium Large datasets where early stopping saves time
Bayesian Optimization Any Low (expensive fits) Deep learning, SVMs, or any model where each fit is costly

Practical workflow:

  1. Start with Randomized Search to identify which hyperparameters matter and roughly where good values tend to live.
  2. Narrow the space and use Grid Search or Bayesian Optimization to fine-tune.
  3. Evaluate the final model on your held-out test set — exactly once.

Question and Answers

1. Detailed Workflow of Bayesian Search for Hyperparameter Tuning

Bayesian Optimization (BO) is an iterative, mathematically driven process that treats hyperparameter tuning as the optimization of an unknown "black-box" function. Here is the complete step-by-step loop:

Step 1. Define the Objective Function and Search Space

Before the loop begins, you must explicitly define what the algorithm is searching through and what it is trying to achieve.

Step 2. Initialize (The "Cold Start")

Because BO learns from past evaluations, it requires a baseline of initial data points to build its first map.

Step 3. Build the Surrogate Model (The "Map")

Using the historical dataset, BO trains a cheap mathematical model to approximate the expensive true objective function.

Step 4. Optimize the Acquisition Function (Strategize)

The Acquisition Function acts as the "policy" that scans the Surrogate Model to decide the absolute best hyperparameter combination to test next (θ). It calculates a score by balancing two competing forces:

Step 5. Execute (Evaluate the True Objective)

Take the single recommended configuration (θ) chosen by the Acquisition Function and train your actual, computationally expensive machine learning model. Calculate the true validation error f(θ).

Step 6. Update the Surrogate Model

Take the new real-world result pair (θ,f(θ)) from Step 5 and append it to your historical dataset. The Surrogate Model is then re-trained (its posterior distribution is updated), making its "map" of the hyperparameter space much more accurate.

Step 7. Loop and Terminate

Repeat Steps 3 through 6. The process stops when a predefined stopping criterion is met: