Skip to main content

EML: Energy-Based learning

Notes about the development and integration of Energy-Based surrogate models.
Created on August 19|Last edited on October 21

Introduction

The main limitation of the EMLOpt algorithm developed so far is the absence of a surrogate model that predicts high uncertainty in regions far from the dataset, preventing the search from exploring unsampled regions.
In order to fix this behaviour we embedded in the MILP model objective function the minimum L1 distance from the dataset, leading to extremely high solver time as the number of samples and dimensions grows.
Although we developed several workarounds to improve the solver performance, this work aim to find a variance model that can handle the samples without adding constraint to the MILP model.



While the probabilistic regressor can make accurate predictions only in region covered by the dataset, models like the Gaussian Process are designed to increase the uncertainty in regions far from the samples. However the GP is based on kernels that are hard to embed in MILP due to the high number of non linearities.

This work documents a feasibility study about the employment of an Energy Based regression model as a surrogate model.

Sources:

In particular the code from the toy problem was rewritten for Tensorflow and refactored to ease the embedding of the NN in the MILP model.



Energy Based Learning

Energy Based Models capture dependencies between variables by associating a scalar energy to each configuration of variables. A low energy is associated to configurations that are likely to happen.

Inference consists in setting the value of observed variables (X) and finding the values of the remaining ones (Y) that minimize the energy via an optimization process (usually gradient descent is used).
Since with EML we already have to solve an optimization problem, inference comes for free.

Because energies are uncalibrated, the Gibbs distribution is used to turn the energy scores into a normalized propability distribution:




When the cardinality of Y is high, it is infeasible to compute the partition function in the denominator, thus Monte Carlo sampling is used to approximate the integral evaluation.

While direct regression models (like MSE regression) predict the mean of a fixed variance Gaussian distribution, probabilistic regression models predict the full set of parameters of a predetermined distribution. Instead, EBMs can learn any distribution without prior:


Energy model vs Probabilistic model

The paper employs the Monte Carlo importance sampling as a contrastive term to avoid the energy score to be low everywhere (model collapse, thus can be useful for our necessity to have high uncertainty (high energy score) in regions far from the samples.

The only hyperparameter required (beside the usual ones for training the NN), is the variance of such guassian distributions.


Loss used in the paper. Sampling is employed with a mixture of M Gaussians (q) with fixed var and mean = true sample values
The 'holes' in the samples corresponds to regions with high uncertainty (low energy). The probabilistic regressor instead would just output a constant variance bridging the gap.

EBM Pros:

  • More expressive than unimodal probabilistic regressor
  • Handles sample distance in the variance model
  • At inference EBM have to solve an optimization problem in order to find the Y such that the score is maximized. Since with the EML method we are already solving an optimization problem, inference comes for free.


EBM Cons:

  • Requires upper and lower bound for the Y values (not a big deal)
  • Higher training time (can be solved by training on GPU)
  • More difficult to train (loss can be NaN due to the denominator)


MILP embedding

The NN employed is the usual MLP with ReLU activations that allows an easy embedding in the MILP model with the EML library.

The main challenge is to be able to replicate the Gibbs sampling that turns the raw scores into a probability distribution.




We can choose to ignore the probabilistic interpretation and just scale the scores in the 0-1 interval using the upper and lower bounds computed by the bound propagation phase.

In order to find the lowest and more confident value (exploitation) we design an objective that combine linearly the value (y) and the confidence score (s) like in the UCB acquisition function:


minxX  ysmin_{x\in X} \ \ y-s 

However, we obtain poor performance, mainly because the partition function is necessary to obtain "sharp" predictions:


Comparison between the raw scores scaled in the 0-1 interval and the prob. distribution obtained with Gibbs sampling.

In order to find the configuration with the lowest confidence (exploration) a solution could be to solve the optimization problem:


minxX maxyY smin_{x \in X} \ max_{y \in Y} \ s

where s is the raw score of the NN.

However further studies are necessary to determine the feasibility in MILP solver.

Another solution could be to implement the sampling strategy in MILP, but it would require to embed the NN multiple times (one for each sample along the Y) and it can be problematic to scale up.



Conclusions

The implementation of the work done so far is an ugly notebook: link github

The challenge is to design an objective for the MILP model that is able to normalize the score in the proper way and/or to obtain meaningful result (maybe exploring other acquisition functions ideas).

At that point is possible to scale and test the algorithm on a harder instance and compare it with the old method.