Hyperparameter Tuning
I. Parameters vs. Hyperparameters
Before tuning anything, it's important to understand what you're actually tuning.
| Parameters | Hyperparameters | |
|---|---|---|
| What they are | Values learned by the model during training | Values set by you before training begins |
| Examples | Weights in a neural network, coefficients in linear regression | Learning rate, tree depth, number of estimators |
| How they're set | Optimized automatically (e.g., via Gradient Descent) | Chosen manually or via a tuning strategy |
Why does this matter? Your model's parameters are only as good as the hyperparameters that govern the training process. A poorly tuned learning rate can prevent your model from converging entirely. A tree that's too deep will memorize training data and fail on new data. Hyperparameter tuning is the process of systematically finding the settings that produce the best generalizing model.
II. The Role of Cross-Validation in Tuning
All tuning methods below rely on cross-validation (CV) to evaluate each hyperparameter combination honestly.
Instead of evaluating on a single train/test split (which can be lucky or unlucky),
- Splits the training data into
equal folds - Trains on
folds and validates on the remaining fold - Repeats
times, rotating which fold is held out - Reports the average validation score across all
runs
This gives a much more reliable estimate of how each hyperparameter combination will perform on unseen data. The cv=5 parameter you see in every sklearn example below is doing exactly this.
⚠️ A word of caution: Tuning hyperparameters on your validation set and then reporting that score as your final result is a form of data leakage. Always keep a final held-out test set that is never used during tuning. Evaluate on it exactly once, at the end.
III. Tuning Methods
1. Grid Search
Concept: Define an explicit list of values for each hyperparameter. Grid search evaluates every possible combination — it is an exhaustive search.
When to use it: When your hyperparameter space is small and you can afford the computation. Good for final fine-tuning around a known good region.
| Pros | Cons |
|---|---|
| Exhaustive: Guaranteed to find the absolute best combination within the exact values you provided. | Curse of Dimensionality: Adding just one new parameter exponentially increases the compute time. Scales exponentially. 3 parameters with 5 values each = |
| Parallelizable: Since each test is independent, you can run all combinations simultaneously across multiple GPUs or servers. | Wastes Resources: It spends equal time testing combinations it should logically know are terrible. |
| Simplicity: Incredibly easy to code, understand, and explain to stakeholders. | Rigid Boundaries: It will never find a better value that exists ""between"" your grid points. |
GridSearchCV — Sklearn Implementation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
# Define the hyperparameter grid
param_grid = {
'criterion': ['gini', 'entropy'],
'max_depth': [4, 5, 6, 7, 8]
}
# 2 x 5 = 10 combinations, each evaluated 5 times via CV = 50 model fits
clf = GridSearchCV(
estimator=DecisionTreeClassifier(),
param_grid=param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1 # use all available CPU cores
)
clf.fit(X_train, y_train)
print("Best parameters:", clf.best_params_)
print("Best CV score: ", clf.best_score_)
2. Randomized Search
Concept: Instead of trying every combination, randomly sample a fixed number of combinations from the hyperparameter space. You control the budget via n_iter.
When to use it: When your hyperparameter space is large and grid search would be too slow. A good first pass to identify which hyperparameters matter most and where good values tend to live.
| Pros | Cons |
|---|---|
| Highly Efficient: Often finds a near-optimal model in a fraction of the time it takes Grid Search. | No Guarantees: Because it relies on chance, it might miss the absolute global maximum. |
| Finds In-Between Values: By using continuous distributions, it can test highly specific numbers (e.g., a learning rate of 0.0314) that a rigid grid would miss. | Blind Search: Like Grid Search, it does not learn from its mistakes. It might test a terrible parameter space multiple times by random chance. |
| Strict Budget Control: You dictate exactly how many iterations it runs, capping your maximum compute cost. | Suboptimal for Small Spaces: If you only have a few parameters to test, Grid Search is safer. |
RandomizedSearchCV — Sklearn Implementation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
'criterion': ['gini', 'entropy'],
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Instead of all 4x3x3 = 36 combinations, we only evaluate n_iter=10
clf = RandomizedSearchCV(
estimator=DecisionTreeClassifier(),
param_distributions=param_dist,
n_iter=10,
cv=5,
scoring='accuracy',
random_state=42,
n_jobs=-1
)
clf.fit(X_train, y_train)
print("Best parameters:", clf.best_params_)
print("Best CV score: ", clf.best_score_)
3. Successive Halving
Concept: Start with many hyperparameter combinations but allocate very little data to each. After each round ("halving"), eliminate the bottom half of candidates and give the survivors more data and more training time. Repeat until one winner remains.
When to use it: When you have a large dataset and a large search space. It's significantly faster than standard Grid or Random Search because poor candidates are eliminated early before wasting full training resources on them.
| Pros | Cons |
|---|---|
| Massive Throughput: You can screen thousands of combinations simultaneously without spending full compute on the bad ones. | The "Late Bloomer" Risk: Some models start with terrible validation scores but become excellent if given enough time. Halving kills these prematurely. |
| Highly Resource Efficient: Focuses 90% of your computing budget on the top 10% of candidates. | Noisy Early Data: If the initial subset of data is unrepresentative, the algorithm might promote the wrong models to the next round. |
| Scalable: Easily wraps around Grid or Randomized Search to instantly speed them up. | Parameter Sensitivity: You must carefully balance the ""halving factor"" and the ""minimum resources,"" which adds complexity to the setup. |
Sklearn provides two variants:
HalvingGridSearchCV— applies successive halving to an exhaustive gridHalvingRandomSearchCV— applies successive halving to a random sample (best of both worlds)
HalvingRandomSearchCV — Sklearn Implementation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import HalvingRandomSearchCV
param_dist = {
'criterion': ['gini', 'entropy'],
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
clf = HalvingRandomSearchCV(
estimator=DecisionTreeClassifier(),
param_distributions=param_dist,
factor=2, # halve candidates each round
cv=5,
scoring='accuracy',
random_state=42,
n_jobs=-1
)
clf.fit(X_train, y_train)
print("Best parameters:", clf.best_params_)
print("Best CV score: ", clf.best_score_)
4. Bayesian Optimization
Problem: Actual machine learning model, We want to minimize (or maximize) an objective function:
But
- expensive to evaluate (train + validate),
- noisy (CV randomness),
- has no closed-form gradient.
Solution
- The Surrogate Model (The "Map")
Instead, Bayesian Optimization builds a a probabilistic surrogate model of the objective function (validation score vs. hyperparameters). In most professional libraries, this is powered by a Gaussian Process (GP).- The GP looks at the hyperparameters you have tested so far and draws a curve predicting how the model will perform across all the un-tested parameters.
- The Acquisition Function (The "Decision Maker")
Now that you have a map with predictions and uncertainty, you need a strategy for where to test next. This is handled by the Acquisition Function.
The Acquisition Function scans the Surrogate Model and calculates a score for every possible hyperparameter combination by balancing two competing forces:- Exploitation: Testing parameters in areas where the Surrogate Model predicts excellent performance (drilling where you already found traces of oil).
- Exploration: Testing parameters in areas where the Surrogate Model is highly uncertain (drilling in completely uncharted territory, just in case there is a massive hidden reserve).
A standard mathematical formula used here is Expected Improvement (EI), which quantifies exactly how much better a new point is likely to be compared to the best result you have found so far.
When to use it: When each model fit is expensive (large datasets, deep networks) and you can only afford a limited number of evaluations. It typically finds near-optimal hyperparameters in far fewer iterations than random search.
| Pros | Cons |
|---|---|
| Learns from Mistakes: It never wastes time on combinations that its internal math predicts will fail. | Inherently Sequential: Because step 3 relies on the results of step 2, it is extremely difficult to parallelize across multiple GPUs. |
| Best for Expensive Models: If a single training run takes 6 hours (like Deep Learning or massive XGBoost ensembles), Bayesian minimizes the total number of runs needed. | Overhead Cost: The Bayesian math itself takes a few seconds to calculate. If your model trains in 0.1 seconds, the tuning engine takes longer than the model training. |
| Handles Complex Spaces: Excellent at navigating continuous hyperparameter distributions to find the exact decimal sweet spot. | Complex Implementation: Harder to set up from scratch, usually requiring specialized libraries (Optuna, Hyperopt, SMAC). |
BayesSearchCV — Sklearn Implementation
# Install dependency first: pip install scikit-optimize
import numpy as np
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Define continuous and categorical search spaces
search_space = {
'C': Real(1e-6, 1e+6, prior='log-uniform'), # regularization strength
'gamma': Real(1e-6, 1e+1, prior='log-uniform'), # kernel coefficient
'kernel': Categorical(['linear', 'rbf'])
}
opt = BayesSearchCV(
estimator=SVC(),
search_spaces=search_space,
n_iter=50, # number of hyperparameter evaluations
scoring='accuracy',
cv=5,
random_state=42,
n_jobs=-1
)
opt.fit(X_train, y_train)
print("Best parameters:", opt.best_params_)
print("Best CV score: ", opt.best_score_)
# Evaluate once on the held-out test set
y_pred = opt.best_estimator_.predict(X_test)
print("Test accuracy: ", accuracy_score(y_test, y_pred))
Search space types (from skopt.space):
Real(low, high, prior='log-uniform')— continuous range, log scale ideal for values spanning orders of magnitude (e.g., learning rate, C)Integer(low, high)— discrete integer range (e.g., tree depth, number of estimators)Categorical([...])— unordered discrete choices (e.g., kernel type, activation function)
IV. When to Use Which Method
| Method | Search Space Size | Compute Budget | Best For |
|---|---|---|---|
| Grid Search | Small | High | Final fine-tuning in a known good region |
| Randomized Search | Large | Medium | First-pass exploration, identifying important hyperparameters |
| Successive Halving | Large | Low–Medium | Large datasets where early stopping saves time |
| Bayesian Optimization | Any | Low (expensive fits) | Deep learning, SVMs, or any model where each fit is costly |
Practical workflow:
- Start with Randomized Search to identify which hyperparameters matter and roughly where good values tend to live.
- Narrow the space and use Grid Search or Bayesian Optimization to fine-tune.
- Evaluate the final model on your held-out test set — exactly once.
Question and Answers
1. Detailed Workflow of Bayesian Search for Hyperparameter Tuning
Bayesian Optimization (BO) is an iterative, mathematically driven process that treats hyperparameter tuning as the optimization of an unknown "black-box" function. Here is the complete step-by-step loop:
Step 1. Define the Objective Function and Search Space
Before the loop begins, you must explicitly define what the algorithm is searching through and what it is trying to achieve.
- The Objective Function (
): The actual metric you want to minimize (e.g., Validation RMSE, Log Loss) or maximize (e.g., F1-Score, Accuracy). - The Search Space: The boundaries for your hyperparameters (
). This space can mix continuous, integer, and categorical values. - Example:
(often searched on a logarithmic scale). - Example:
.
- Example:
Step 2. Initialize (The "Cold Start")
Because BO learns from past evaluations, it requires a baseline of initial data points to build its first map.
- Evaluate a small number of configurations (usually 5 to 10) using Random Search or Latin Hypercube Sampling (LHS) (which ensures a more even spread across the search space than pure random selection).
- Run the actual model training and validation for each configuration and store the results as a dataset of pairs:
.
Step 3. Build the Surrogate Model (The "Map")
Using the historical dataset, BO trains a cheap mathematical model to approximate the expensive true objective function.
- Common Surrogate Models:
- Gaussian Process (GP): The classic BO approach; excellent for continuous, low-to-medium dimensional spaces.
- Tree-structured Parzen Estimator (TPE): Highly efficient for categorical/conditional hyperparameters (the default in Optuna and Hyperopt).
- Random Forests: Used in SMAC, great for heavily categorical spaces.
- What it outputs: For any untested hyperparameter
, a GP surrogate provides a probability distribution: - Mean prediction:
(What it thinks the score will be). - Uncertainty:
(How confident it is in that guess).
- Mean prediction:
Step 4. Optimize the Acquisition Function (Strategize)
The Acquisition Function acts as the "policy" that scans the Surrogate Model to decide the absolute best hyperparameter combination to test next (
- Exploitation: Testing parameters where the predicted performance looks excellent (e.g., low
for error minimization). - Exploration: Testing parameters in completely uncharted regions (high
) just in case a better configuration is hidden there. - Common Acquisition Functions (Missing Detail):
- Expected Improvement (EI): The industry standard. It calculates the mathematical expectation of how much better a new point will be compared to the current best-known point.
- Probability of Improvement (PI): Measures the likelihood that a new point will beat the current best, regardless of by how much.
- Upper/Lower Confidence Bound (UCB/LCB): Directly trades off mean and variance. For minimization:
, where controls the exploration level.
Step 5. Execute (Evaluate the True Objective)
Take the single recommended configuration (
Step 6. Update the Surrogate Model
Take the new real-world result pair
Step 7. Loop and Terminate
Repeat Steps 3 through 6. The process stops when a predefined stopping criterion is met:
- Maximum number of evaluations (e.g.,
n_trials = 100). - Time budget exhausted (e.g.,
timeout = 3600seconds). - Early stopping (the Acquisition Function determines that the Expected Improvement of any new point is practically zero).