What Is Bayesian Hyperparameter Optimization? With Tutorial.
All your burning questions about Bayesian hyperparameter optimization answered, with a tutorial.
Created on September 19|Last edited on December 14
Comment
Weights & Biases recently launched Sweeps, a sophisticated way to optimize your model hyperparameters. And since then the team has been getting a lot of questions about Bayesian hyperparameter optimization:
- Is it faster than random search?
- When should I use it?
- How does it work exactly?
In this post, I’ll try to answer some of your most pressing questions about Bayesian hyperparameter optimization. I’ll also show you how to run a Bayesian search in 3 steps using Weights & Biases.
What I'll Be Covering
What Are Hyperparameters?What Is Grid Search?What Is Random Search?What Is Baysian Hyperparameter Optimization?Bayesian Hyperparameter Tuning: Nuts & BoltsExpected ImprovementSurrogate ModelThe Pitfalls Of Bayesian Hyperparameter OptimizationBayesian Hyperparameter Optimization In ActionBonus: Create Sweeps At Blazing Speeds
You can find the accompanying code here and see a comparison of Bayesian, grid and random sweeps here.

Before diving in, let's answer the fundamental question for those newer to the space.
What Are Hyperparameters?
Hyperparameters in a machine learning model are the knobs used to optimize the performance of your model. Given a specific task, it’s tricky to find the right hyperparameter combinations for a machine learning model.
What’s even more troublesome is that machine learning models are very sensitive to their hyperparameter configurations. The performance of a machine learning model with a certain hyperparameter configuration may not be similar when the hyperparameter configuration is changed.
To address these problems, we resort to hyperparameter tuning. The general process of this is roughly:
- Select the hyperparameters to be tuned (there can be several hyperparameters in a machine learning model).
- Specify a grid of acceptable values for the specified hyperparameters or specify distributions that would generate the acceptable values.
- Train several machine learning models pertaining to each of the different hyperparameter configurations results from the above two steps.
- Select the model that performs the best from the pool of many models.
Although there are many niche techniques that help us in effectively tuning the hyperparameters, the most predominant ones are:
- Grid search
- Random search
What Is Grid Search?
In grid search, we first start by defining a grid containing the list of hyperparameters along with lists of acceptable values you would want the search process to try. Following is a sample hyperparameter grid for a Logistic Regression model:

In this case, the search process would end up training a total of 2x3x3x4 = 72 different Logistic Regression models.
What Is Random Search?
In random search, other than defining a grid of hyperparameter values, we specify a distribution from which the acceptable values for the specified hyperparameters could be sampled. The main difference between random search and grid search is that instead of trying all the possible combinations of hyperparameters, each combination of hyperparameters is sampled randomly and the hyperparameter values come from the distribution that we specify at the beginning of the random search process.
In the general hyperparameter optimization framework we just discussed, were you able to spot anything that could be improved?
💡
The different hyperparameter configurations that we specify are all independent of each other. Could those configurations be made a bit more informed? What if we could guide the hyperparameter search process by using the past results (in this case, results from running different hyperparameter configurations)? Would we be able to make the tuning process better? Let’s find it out!
To begin our journey, let's answer the question for clarity:
What Is Baysian Hyperparameter Optimization?
Bayesian hyperparameter optimization is a technique for finding the best settings for the "knobs" of your machine learning model – the hyperparameters – that control its performance. Unlike traditional methods like those noted above, which try every possible combination blindly, Bayesian optimization uses a smart and efficient approach to guide its search, informed by previous evaluations.
Bayesian Hyperparameter Tuning: Nuts & Bolts
When it comes to using Bayesian principles in hyperparameter tuning the following steps are generally followed:
- Pick a combination of hyperparameter values (our belief) and train the machine learning model.
- Get the evidence (i.e., the score of the model)
- Update our belief that can lead to model improvement.
- Terminate when a stopping criteria is met (generally when a loss a quantity is minimized or classification accuracy is maximized).
To understand this in a bit more details, let’s introduce the 250+ year old Bayes rule. It helps us to predict Y given X; what is the probability of John playing soccer tomorrow provided it rains today, for example. So, the Y here is the probability of John playing soccer tomorrow, and X denotes it rained today. In probabilistic literature, you can denote it like so -

It read as - what is the probability of Y given X. Now, this has got an RHS as well:

where,
- P(X) is the probability of observing this new evidence.
- P(X|Y) is the probability of observing the new evidence X given the event Y that we care about.
- P(Y) is the initial hypothesis about the event Y that we care about (treat it like an initial belief about the event Y).
So, if we try to apply the same principle to hyperparameter tuning, the following would be the equation for it -

So, now, each of the above terms would be:
- P(metric | hyperparameter combination) gives the probability of the given metric to be minimized/maximized given the combination of hyperparameter values.
- P(hyperparameter combination | metric) is the probability of a certain hyperparameter combination if the given metric is minimized/maximized.
- P(metric) is the initial metric quantity in scalar.
- P(hyperparameter combination) is the probability of getting that particular hyperparameter combination.
Now, the different hyperparameter configurations are navigated through, they go on to take advantage from the previous ones and eventually help the given machine learning model train with better hyperparameter combination with each passing run. A question that might still be bugging you is - what hyperparameter space should we explore next in order to improve the most? Let’s find that out.
The idea of informed search in Bayesian hyperparameter tuning
Take a look at the following figure, which shows how the training accuracy of a certain neural network varies with the number of epochs -

We can clearly see that we get better results with more epochs. So, the point now is how can we leverage this knowledge in the hyperparameter optimization process?
Bayesian hyperparameter optimization allows us to do this by building a probabilistic model for the objective function we are trying to minimize/maximize to train our machine learning model.
Examples of such objective functions are not scary - accuracy, root mean squared error, and so on. This probabilistic model is referred to as a surrogate model for the objective function in this case. It is represented as P(metric | hyperparameter combination) or more generically P(y|x).
Experiments show that this surrogate model is much easier to optimize than the objective function. It’s important to remember that the next set of hyperparameters in the hyperparameter optimization process is chosen to perform the best on the surrogate function. That set is then further evaluated using the actual objective function. An important catch here is -
The lower the number of calls to the objective function the more efficient the hyperparameter search will be. This is because as the dimensionality of the data increases it becomes more computationally costly to evaluate that using the objective function. Having the surrogate model helps us in reducing the number of calls to the actual objective function by choosing a promising set of hyperparameters.
This is exactly why Bayesian hyperparameter optimization is preferable when the hyperparameter tuning task includes a lot of different combinations.
Let’s go back to the way we humans approach things in general - we first build an initial model (also called a prior) of the world we are about to step in, and based on our experiences of interactions (evidence) we update that model (the updated model is called posterior). Now, let’s replicate this to tune hyperparameters!
We start with an initial estimate of the hyperparameters and we update it gradually based on the past results. Consider the following to be a representation of that initial estimate -

The black line is the initial estimate made by the surrogate model and the red line represents the true values of the objective function. As we proceed the surrogate model manages to mimic the true values of the objective function much more densely:

The gray areas represent the uncertainty of the surrogate model, which is defined by a standard deviation and a mean.
To select the next set of hyperparameters in this process, we need to devise a way to return a probabilistic score of that set of hyperparameters. The better the score, the more likely the set of hyperparameters is to be selected. This is generally done via expected improvement. Other methods include probability of improvement, lower confidence bound, and so on.
Expected Improvement
Expected improvement works by introducing a threshold value for the objective function, and we are tasked to find a set of hyperparameters that beats that threshold. So, mathematically it would be -

where -
- y^* is a threshold value for the objective function
- y is the actual value of the objective function
- p(y|x) is the surrogate model
- The above equation enforces the following:
If for a certain x (a combination of hyperparameters) p(y|x) becomes such that y^* > y then it means that hyperparameter combination will not yield better score than the threshold but if it’s the opposite then it’s worth pursuing that combination of hyperparameters.
The last piece of the puzzle remains to be added however - the surrogate model.
Surrogate Model
The task of constructing a surrogate model is generally modeled as a regression problem where we feed the data as input (with a set of hyperparameters) and it returns an approximation of the objective function parameterized by a mean and a standard deviation. The common choices for surrogate models are:
- Gaussian Process Regression
- Tree-structured Parzen Estimator
Let’s talk about the first one briefly.
The Gaussian Process works by constructing a joint probability over the input features and the true values of the objective function. In that way, with enough iterations, it is able to capture an effective estimate of the objective function.
If all of the above seemed like a bit heavy, you can simply stick to the idea that the main objective of Bayesian reasoning is to become “less wrong” with more data.
The Pitfalls Of Bayesian Hyperparameter Optimization
Even though Bayesian hyperparameter optimization makes the most sense compared to the other approaches of hyperparameter tuning, it has down sides:
The Bayesian search process is sequential in nature, so it’s extremely hard to parallelize it, which might be necessary to scale.
Defining a well-suited surrogate model can be challenging and has its own hyperparameters.
With that, let's see
Bayesian Hyperparameter Optimization In Action
I have made the code snippets shown in this section available as a Colab notebook here (no setup is required to run it).
Before diving into the code that deals with Bayesian hyperparameter optimization, let’s put together the components we need.
We will be using the Keras library for our experiments (more specifically tf.keras with a TensorFlow 2.0 environment). We will use the FashionMNIST dataset and a shallow convolutional neural network as our machine learning model. Our humble model is defined using -
model = Sequential([Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),MaxPooling2D((2,2)),Conv2D(64, (3, 3), activation='relu'),MaxPooling2D((2,2)),Conv2D(64, (3, 3), activation='relu'),GlobalAveragePooling2D(),Dense(config.layers, activation=tf.nn.relu),Dense(10, activation='softmax')])
The images of the dataset come in 28x28 pixels. But we need to reshape it to 28x28x1 to be able to make it work with the Conv2D layer of Keras. Proceeding further, here is how our hyperparameter search configuration looks like -
sweep_config = {'method': method,'metric': {'name': 'accuracy','goal': 'maximize'},'parameters': {'layers': {'values': [32, 64, 96, 128, 256]},'batch_size': {'values': [32, 64, 96, 128]},'epochs': {'values': [5, 10, 15]}}}
We'll run the same experiments with three different hyperparameter optimization methods: a thorough search (grid), a random sampling approach (random), and a smart guesser (Bayesian). Our goal is to maximize our objective function (metric) by fine-tuning key parameters in our model: the number of dense layers, batch size, and number of training cycles (epochs).
W&B makes tuning hyperparameters a breeze with its Sweeps feature. It's like a guided search engine for your model's best settings. Once you've prepared the data, defined your model, and set up the search parameters, W&B takes care of the rest with just a few clicks.
sweep_id = wandb.sweep(sweep_config, project='project-name')wandb.agent(sweep_id, function=train)
The train function is responsible for training the model with the specified hyperparameters.
def train():# Prepare data tuples(X_train, y_train) = train_images, train_labels(X_test, y_test) = test_images, test_labels# Default values for hyper-parameters we're going to sweep overconfigs = {'layers': 128,'batch_size': 64,'epochs': 5,'method': METHOD}# Initilize a new wandb runwandb.init(project='hyperparameter-sweeps-comparison', config=configs)# Config is a variable that holds and saves hyperparameters and inputsconfig = wandb.config# Define the modelmodel = Sequential([Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),MaxPooling2D((2,2)),Conv2D(64, (3, 3), activation='relu'),MaxPooling2D((2,2)),Conv2D(64, (3, 3), activation='relu'),GlobalAveragePooling2D(),Dense(config.layers, activation=tf.nn.relu)Dense(10, activation='softmax')])# Compile the modelmodel.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])# Train the modelmodel.fit(X_train, y_train,epochs=config.epochs,batch_size=config.batch_size,validation_data=(X_test, y_test),callbacks=[WandbCallback(data_type="image",validation_data=(X_test, y_test), labels=labels)])
With each run the hyperparameter combination is updated and is made available via the config argument of wandb. Diving deep into the details of hyperparameter sweeps using W&B is out of the scope of this article, but if you are interested, I recommend the following articles -
As mentioned earlier, we will be running the sweeps using three different methods - grid search, random search and Bayesian search. Let’s do a battle of the three!
The beauty of using W&B is that it creates separate Sweeps pages each time you kickstart a new sweep. The runs under the grid search experiment are available here to visualize. The major plot to notice is the following:

Before I proceed any further, let me show you plots I got from random search and Bayesian search -


From the above three plots, the most important thing to notice is the number of different runs the method took to reach the maximum val_accuracy. A different combination of hyperparameters is referred to as a different run here. Clearly enough, the Bayesian search takes the least amount of runs to get to our desired result.
An important note here is for both random search and Bayesian search, you need to manually terminate the search process once you get to the desired result. You can easily do that by going to your respective Sweeps page while that is running and navigating to Sweep Controls -

Upon clicking that you will get a page like this and from there, you can have complete control over your sweeps -


Now, on the respective workspace, you will have a collective overview of the first ten runs that came from running the different methods -

Another point if favor of Baysian hyperparameter optimziation is that if we group all the important metrics like loss, val_loss, accuracy and val_accuracy we will clearly see the supremacy of the Bayesian search process.

Bonus: Create Sweeps At Blazing Speeds
On your Sweeps Overview page (a sample), you would find a button Create a sweep and when clicked, it will show something like:

It lets you take an existing W&B Sweeps project with some runs and quickly generate a new sweep based on your configuration variables.
Now, what I would do is -
- Download the default configuration file (sweep.yaml) (you can configure your own).
- Package the train function in a Python script and name it as train.py. You can check out the script here.
- Follow the instructions from the above snapshot.
And voila! Your sweep should be up and running -

So, how Bayesian are you?
That’s it for this article. If the topic of hyperparameter optimization interests you, you should check out James Bergstra’s works -
I hope you got a good introduction to the Bayesian search process for tuning hyperparameters for Machine Learning models. I cannot wait to see how beneficial Bayesian search works for your projects. Tell me about them in the comments below!
Add a comment
Tags: W&B Features
Iterate on AI agents and models faster. Try Weights & Biases today.