Hyperparameter Tuning

I. Parameters vs. Hyperparameters

Before tuning anything, it's important to understand what you're actually tuning.

	Parameters	Hyperparameters
What they are	Values learned by the model during training	Values set by you before training begins
Examples	Weights in a neural network, coefficients in linear regression	Learning rate, tree depth, number of estimators
How they're set	Optimized automatically (e.g., via Gradient Descent)	Chosen manually or via a tuning strategy

Why does this matter? Your model's parameters are only as good as the hyperparameters that govern the training process. A poorly tuned learning rate can prevent your model from converging entirely. A tree that's too deep will memorize training data and fail on new data. Hyperparameter tuning is the process of systematically finding the settings that produce the best generalizing model.

II. The Role of Cross-Validation in Tuning

All tuning methods below rely on cross-validation (CV) to evaluate each hyperparameter combination honestly.

Instead of evaluating on a single train/test split (which can be lucky or unlucky), $k$ -fold cross-validation:

Splits the training data into $k$ equal folds
Trains on $k - 1$ folds and validates on the remaining fold
Repeats $k$ times, rotating which fold is held out
Reports the average validation score across all $k$ runs

This gives a much more reliable estimate of how each hyperparameter combination will perform on unseen data. The cv=5 parameter you see in every sklearn example below is doing exactly this.

⚠️ A word of caution: Tuning hyperparameters on your validation set and then reporting that score as your final result is a form of data leakage. Always keep a final held-out test set that is never used during tuning. Evaluate on it exactly once, at the end.

III. Tuning Methods

1. Grid Search

Concept: Define an explicit list of values for each hyperparameter. Grid search evaluates every possible combination — it is an exhaustive search.

When to use it: When your hyperparameter space is small and you can afford the computation. Good for final fine-tuning around a known good region.

Pros	Cons
Exhaustive: Guaranteed to find the absolute best combination within the exact values you provided.	Curse of Dimensionality: Adding just one new parameter exponentially increases the compute time. Scales exponentially. 3 parameters with 5 values each = $5^{3} = 125$ fits. Add one more parameter and you're at 625.
Parallelizable: Since each test is independent, you can run all combinations simultaneously across multiple GPUs or servers.	Wastes Resources: It spends equal time testing combinations it should logically know are terrible.
Simplicity: Incredibly easy to code, understand, and explain to stakeholders.	Rigid Boundaries: It will never find a better value that exists ""between"" your grid points.

GridSearchCV — Sklearn Implementation

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [4, 5, 6, 7, 8]
}
# 2 x 5 = 10 combinations, each evaluated 5 times via CV = 50 model fits

clf = GridSearchCV(
    estimator=DecisionTreeClassifier(),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1  # use all available CPU cores
)

clf.fit(X_train, y_train)

print("Best parameters:", clf.best_params_)
print("Best CV score:  ", clf.best_score_)

2. Randomized Search

Concept: Instead of trying every combination, randomly sample a fixed number of combinations from the hyperparameter space. You control the budget via n_iter.

When to use it: When your hyperparameter space is large and grid search would be too slow. A good first pass to identify which hyperparameters matter most and where good values tend to live.

Pros	Cons
Highly Efficient: Often finds a near-optimal model in a fraction of the time it takes Grid Search.	No Guarantees: Because it relies on chance, it might miss the absolute global maximum.
Finds In-Between Values: By using continuous distributions, it can test highly specific numbers (e.g., a learning rate of 0.0314) that a rigid grid would miss.	Blind Search: Like Grid Search, it does not learn from its mistakes. It might test a terrible parameter space multiple times by random chance.
Strict Budget Control: You dictate exactly how many iterations it runs, capping your maximum compute cost.	Suboptimal for Small Spaces: If you only have a few parameters to test, Grid Search is safer.

`RandomizedSearchCV` — Sklearn Implementation

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
# Instead of all 4x3x3 = 36 combinations, we only evaluate n_iter=10

clf = RandomizedSearchCV(
    estimator=DecisionTreeClassifier(),
    param_distributions=param_dist,
    n_iter=10,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)

clf.fit(X_train, y_train)

print("Best parameters:", clf.best_params_)
print("Best CV score:  ", clf.best_score_)

3. Successive Halving

Concept: Start with many hyperparameter combinations but allocate very little data to each. After each round ("halving"), eliminate the bottom half of candidates and give the survivors more data and more training time. Repeat until one winner remains.

When to use it: When you have a large dataset and a large search space. It's significantly faster than standard Grid or Random Search because poor candidates are eliminated early before wasting full training resources on them.

Pros	Cons
Massive Throughput: You can screen thousands of combinations simultaneously without spending full compute on the bad ones.	The "Late Bloomer" Risk: Some models start with terrible validation scores but become excellent if given enough time. Halving kills these prematurely.
Highly Resource Efficient: Focuses 90% of your computing budget on the top 10% of candidates.	Noisy Early Data: If the initial subset of data is unrepresentative, the algorithm might promote the wrong models to the next round.
Scalable: Easily wraps around Grid or Randomized Search to instantly speed them up.	Parameter Sensitivity: You must carefully balance the ""halving factor"" and the ""minimum resources,"" which adds complexity to the setup.

Sklearn provides two variants:

HalvingGridSearchCV — applies successive halving to an exhaustive grid
HalvingRandomSearchCV — applies successive halving to a random sample (best of both worlds)

`HalvingRandomSearchCV` — Sklearn Implementation

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import HalvingRandomSearchCV

param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

clf = HalvingRandomSearchCV(
    estimator=DecisionTreeClassifier(),
    param_distributions=param_dist,
    factor=2,        # halve candidates each round
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)

clf.fit(X_train, y_train)

print("Best parameters:", clf.best_params_)
print("Best CV score:  ", clf.best_score_)

4. Bayesian Optimization

Problem: Actual machine learning model, We want to minimize (or maximize) an objective function:

f (θ) = validation loss (or error) where θ are hyperparameters

But $f (θ)$ is:

expensive to evaluate (train + validate),
noisy (CV randomness),
has no closed-form gradient.
Solution

The Surrogate Model (The "Map")
Instead, Bayesian Optimization builds a a probabilistic surrogate model of the objective function (validation score vs. hyperparameters). In most professional libraries, this is powered by a Gaussian Process (GP).
- The GP looks at the hyperparameters you have tested so far and draws a curve predicting how the model will perform across all the un-tested parameters.
The Acquisition Function (The "Decision Maker")
Now that you have a map with predictions and uncertainty, you need a strategy for where to test next. This is handled by the Acquisition Function.
The Acquisition Function scans the Surrogate Model and calculates a score for every possible hyperparameter combination by balancing two competing forces:
- Exploitation: Testing parameters in areas where the Surrogate Model predicts excellent performance (drilling where you already found traces of oil).
- Exploration: Testing parameters in areas where the Surrogate Model is highly uncertain (drilling in completely uncharted territory, just in case there is a massive hidden reserve).
  A standard mathematical formula used here is Expected Improvement (EI), which quantifies exactly how much better a new point is likely to be compared to the best result you have found so far.

When to use it: When each model fit is expensive (large datasets, deep networks) and you can only afford a limited number of evaluations. It typically finds near-optimal hyperparameters in far fewer iterations than random search.

Pros	Cons
Learns from Mistakes: It never wastes time on combinations that its internal math predicts will fail.	Inherently Sequential: Because step 3 relies on the results of step 2, it is extremely difficult to parallelize across multiple GPUs.
Best for Expensive Models: If a single training run takes 6 hours (like Deep Learning or massive XGBoost ensembles), Bayesian minimizes the total number of runs needed.	Overhead Cost: The Bayesian math itself takes a few seconds to calculate. If your model trains in 0.1 seconds, the tuning engine takes longer than the model training.
Handles Complex Spaces: Excellent at navigating continuous hyperparameter distributions to find the exact decimal sweet spot.	Complex Implementation: Harder to set up from scratch, usually requiring specialized libraries (Optuna, Hyperopt, SMAC).

`BayesSearchCV` — Sklearn Implementation

# Install dependency first: pip install scikit-optimize

import numpy as np
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define continuous and categorical search spaces
search_space = {
    'C':      Real(1e-6, 1e+6, prior='log-uniform'),  # regularization strength
    'gamma':  Real(1e-6, 1e+1, prior='log-uniform'),  # kernel coefficient
    'kernel': Categorical(['linear', 'rbf'])
}

opt = BayesSearchCV(
    estimator=SVC(),
    search_spaces=search_space,
    n_iter=50,           # number of hyperparameter evaluations
    scoring='accuracy',
    cv=5,
    random_state=42,
    n_jobs=-1
)

opt.fit(X_train, y_train)

print("Best parameters:", opt.best_params_)
print("Best CV score:  ", opt.best_score_)

# Evaluate once on the held-out test set
y_pred = opt.best_estimator_.predict(X_test)
print("Test accuracy:  ", accuracy_score(y_test, y_pred))

Search space types (from skopt.space):

Real(low, high, prior='log-uniform') — continuous range, log scale ideal for values spanning orders of magnitude (e.g., learning rate, C)
Integer(low, high) — discrete integer range (e.g., tree depth, number of estimators)
Categorical([...]) — unordered discrete choices (e.g., kernel type, activation function)

IV. When to Use Which Method

Method	Search Space Size	Compute Budget	Best For
Grid Search	Small	High	Final fine-tuning in a known good region
Randomized Search	Large	Medium	First-pass exploration, identifying important hyperparameters
Successive Halving	Large	Low–Medium	Large datasets where early stopping saves time
Bayesian Optimization	Any	Low (expensive fits)	Deep learning, SVMs, or any model where each fit is costly

Practical workflow:

Start with Randomized Search to identify which hyperparameters matter and roughly where good values tend to live.
Narrow the space and use Grid Search or Bayesian Optimization to fine-tune.
Evaluate the final model on your held-out test set — exactly once.

Question and Answers

1. Detailed Workflow of Bayesian Search for Hyperparameter Tuning

Bayesian Optimization (BO) is an iterative, mathematically driven process that treats hyperparameter tuning as the optimization of an unknown "black-box" function. Here is the complete step-by-step loop:

Step 1. Define the Objective Function and Search Space

Before the loop begins, you must explicitly define what the algorithm is searching through and what it is trying to achieve.

The Objective Function ( $f (θ)$ ): The actual metric you want to minimize (e.g., Validation RMSE, Log Loss) or maximize (e.g., F1-Score, Accuracy).
The Search Space: The boundaries for your hyperparameters ( $θ$ ). This space can mix continuous, integer, and categorical values.
- Example: $learning_rate \in [10^{- 5}, 10^{- 1}]$ (often searched on a logarithmic scale).
- Example: $max_depth \in {2, 3, \dots, 12}$ .

Step 2. Initialize (The "Cold Start")

Because BO learns from past evaluations, it requires a baseline of initial data points to build its first map.

Evaluate a small number of configurations (usually 5 to 10) using Random Search or Latin Hypercube Sampling (LHS) (which ensures a more even spread across the search space than pure random selection).
Run the actual model training and validation for each configuration and store the results as a dataset of pairs: $(θ, f (θ))$ .

Step 3. Build the Surrogate Model (The "Map")

Using the historical dataset, BO trains a cheap mathematical model to approximate the expensive true objective function.

Common Surrogate Models:
- Gaussian Process (GP): The classic BO approach; excellent for continuous, low-to-medium dimensional spaces.
- Tree-structured Parzen Estimator (TPE): Highly efficient for categorical/conditional hyperparameters (the default in Optuna and Hyperopt).
- Random Forests: Used in SMAC, great for heavily categorical spaces.
What it outputs: For any untested hyperparameter $θ$ , a GP surrogate provides a probability distribution:
- Mean prediction: $μ (θ)$ (What it thinks the score will be).
- Uncertainty: $σ (θ)$ (How confident it is in that guess).

Step 4. Optimize the Acquisition Function (Strategize)

The Acquisition Function acts as the "policy" that scans the Surrogate Model to decide the absolute best hyperparameter combination to test next ( $θ^{*}$ ). It calculates a score by balancing two competing forces:

Exploitation: Testing parameters where the predicted performance looks excellent (e.g., low $μ (θ)$ for error minimization).
Exploration: Testing parameters in completely uncharted regions (high $σ (θ)$ ) just in case a better configuration is hidden there.
Common Acquisition Functions (Missing Detail):
- Expected Improvement (EI): The industry standard. It calculates the mathematical expectation of how much better a new point will be compared to the current best-known point.
- Probability of Improvement (PI): Measures the likelihood that a new point will beat the current best, regardless of by how much.
- Upper/Lower Confidence Bound (UCB/LCB): Directly trades off mean and variance. For minimization: $LCB (θ) = μ (θ) - κ \cdot σ (θ)$ , where $κ$ controls the exploration level.

Step 5. Execute (Evaluate the True Objective)

Take the single recommended configuration ( $θ^{*}$ ) chosen by the Acquisition Function and train your actual, computationally expensive machine learning model. Calculate the true validation error $f (θ^{*})$ .

Step 6. Update the Surrogate Model

Take the new real-world result pair $(θ^{*}, f (θ^{*}))$ from Step 5 and append it to your historical dataset. The Surrogate Model is then re-trained (its posterior distribution is updated), making its "map" of the hyperparameter space much more accurate.

Step 7. Loop and Terminate

Repeat Steps 3 through 6. The process stops when a predefined stopping criterion is met:

Maximum number of evaluations (e.g., n_trials = 100).
Time budget exhausted (e.g., timeout = 3600 seconds).
Early stopping (the Acquisition Function determines that the Expected Improvement of any new point is practically zero).