Hyperparameter Tuning

Introduction

Hyperparameter tuning is the process of finding the optimal configuration of hyperparameters for a machine learning model. Unlike model parameters (weights and biases) that are learned during training, hyperparameters are set before training begins and control the learning process itself.

The choice of hyperparameters can dramatically affect model performance, making hyperparameter optimization a crucial step in the machine learning pipeline.

What are Hyperparameters?

Definition

Hyperparameters are configuration settings that control the behavior of machine learning algorithms. They are not learned from data but must be specified before training begins.

Common Hyperparameters

Neural Networks

Learning rate: How fast the model learns
Batch size: Number of samples per gradient update
Number of layers: Depth of the network
Hidden units: Width of each layer
Dropout rate: Regularization strength
Activation functions: Type of non-linearity

Tree-based Models

Max depth: Maximum tree depth
Min samples split: Minimum samples to split a node
Number of estimators: Number of trees (for ensembles)
Learning rate: Shrinkage parameter (for boosting)

Support Vector Machines

C parameter: Regularization strength
Kernel type: Linear, RBF, polynomial
Gamma: Kernel coefficient
Degree: Polynomial degree

The Hyperparameter Optimization Problem

Search Space

The hyperparameter space is the set of all possible hyperparameter configurations. This space can be:

Continuous: Learning rate ∈ 0.001, 0.1
Discrete: Batch size ∈ {16, 32, 64, 128}
Categorical: Optimizer ∈ {Adam, SGD, RMSprop}
Conditional: Some parameters only matter when others have specific values

Objective Function

The goal is to find hyperparameters θ that minimize (or maximize) an objective function:

θ* = argmin f(θ)
     θ∈Θ

Where:

θ represents hyperparameters
Θ is the search space
f(θ) is the validation performance

Challenges

Expensive Evaluation: Each configuration requires full model training
High Dimensionality: Many hyperparameters to optimize
Mixed Types: Continuous, discrete, and categorical parameters
Non-convex: Multiple local optima
Noisy: Stochastic training introduces noise

Search Strategies

1. Manual Search

Approach: Human expert manually tries different configurations

Pros:

Leverages domain knowledge
Can incorporate intuition
Good for initial exploration

Cons:

Time-consuming
Not systematic
Prone to bias
Doesn't scale

2. Grid Search

Approach: Exhaustively search over a predefined grid of values

learning_rates = [0.001, 0.01, 0.1]
batch_sizes = [16, 32, 64]
hidden_units = [64, 128, 256]

# Try all 3×3×3 = 27 combinations

Pros:

Systematic and reproducible
Guaranteed to find best in grid
Easy to parallelize
Good for low-dimensional spaces

Cons:

Exponential growth with dimensions
Wastes computation on irrelevant parameters
Fixed grid may miss optimal values
Curse of dimensionality

3. Random Search

Approach: Randomly sample configurations from the search space

Pros:

More efficient than grid search
Better coverage of important parameters
Easy to implement and parallelize
Works well in high dimensions

Cons:

No guarantee of finding optimum
May waste time on poor regions
Doesn't learn from previous evaluations
Still requires many evaluations

Why Random Search Works:

Most hyperparameters don't matter equally
Random search better explores important dimensions
Avoids grid search's regular spacing limitations

4. Bayesian Optimization

Approach: Use probabilistic model to guide search

Components:

Surrogate Model: Gaussian Process models f(θ)
Acquisition Function: Decides next point to evaluate
Optimization: Find θ that maximizes acquisition

Process:

Evaluate a few random points
Fit Gaussian Process to observed data
Use acquisition function to select next point
Evaluate new point and update GP
Repeat until budget exhausted

Pros:

Sample efficient
Learns from previous evaluations
Balances exploration vs exploitation
Works well with expensive evaluations

Cons:

More complex to implement
GP inference can be expensive
Assumes smoothness
May struggle with high dimensions

Cross-Validation for Hyperparameter Tuning

The Problem

Using the same data for both hyperparameter selection and performance estimation leads to optimistic bias - the model appears better than it actually is.

Solution: Nested Cross-Validation

Outer Loop (Performance Estimation):
  For each fold:
    Inner Loop (Hyperparameter Selection):
      For each hyperparameter configuration:
        Train on inner training set
        Validate on inner validation set
      Select best hyperparameters
    Train final model with best hyperparameters
    Test on outer test set

Data Splits

Training Set: Fit model parameters
Validation Set: Select hyperparameters
Test Set: Estimate final performance

K-Fold Cross-Validation

Split data into K folds
For each fold:
- Use K-1 folds for training
- Use 1 fold for validation
Average performance across folds

Acquisition Functions

1. Upper Confidence Bound (UCB)

UCB(θ) = μ(θ) + β × σ(θ)

μ(θ): Predicted mean performance
σ(θ): Predicted uncertainty
β: Exploration parameter

Intuition: Select points with high predicted performance OR high uncertainty

2. Expected Improvement (EI)

EI(θ) = E[max(f(θ) - f*, 0)]

f*: Current best observed value
Probability-weighted improvement over current best

3. Probability of Improvement (PI)

PI(θ) = P(f(θ) > f*)

Probability that θ will improve over current best
Simple but can be too greedy

Interactive Demo

Use the controls below to explore different hyperparameter tuning strategies:

Search Strategy Comparison

Grid Search: See systematic exploration
Random Search: Observe random sampling
Bayesian Optimization: Watch intelligent search

Parameters to Tune

Max Iterations: Balance between thoroughness and efficiency
CV Folds: Trade-off between reliability and computation
Scoring Metric: Choose appropriate evaluation criterion

What to Observe

Optimization Progress: How quickly does each strategy find good configurations?
Best Score Evolution: How does the best found score improve over time?
Parameter Importance: Which hyperparameters matter most?
Convergence: When does the search stop improving?