EML: Energy-Based learning
Introduction
While the probabilistic regressor can make accurate predictions only in region covered by the dataset, models like the Gaussian Process are designed to increase the uncertainty in regions far from the samples. However the GP is based on kernels that are hard to embed in MILP due to the high number of non linearities.
This work documents a feasibility study about the employment of an Energy Based regression model as a surrogate model.
Sources:
- A Tutorial for Energy-Based learning (paper)
- The model implemented is described in the paper: https://arxiv.org/abs/1909.12297
In particular the code from the toy problem was rewritten for Tensorflow and refactored to ease the embedding of the NN in the MILP model.
Energy Based Learning
Energy Based Models capture dependencies between variables by associating a scalar energy to each configuration of variables. A low energy is associated to configurations that are likely to happen.
Inference consists in setting the value of observed variables (X) and finding the values of the remaining ones (Y) that minimize the energy via an optimization process (usually gradient descent is used).
Since with EML we already have to solve an optimization problem, inference comes for free.
Because energies are uncalibrated, the Gibbs distribution is used to turn the energy scores into a normalized propability distribution:

When the cardinality of Y is high, it is infeasible to compute the partition function in the denominator, thus Monte Carlo sampling is used to approximate the integral evaluation.
While direct regression models (like MSE regression) predict the mean of a fixed variance Gaussian distribution, probabilistic regression models predict the full set of parameters of a predetermined distribution. Instead, EBMs can learn any distribution without prior:

The paper employs the Monte Carlo importance sampling as a contrastive term to avoid the energy score to be low everywhere (model collapse, thus can be useful for our necessity to have high uncertainty (high energy score) in regions far from the samples.
The only hyperparameter required (beside the usual ones for training the NN), is the variance of such guassian distributions.


EBM Pros:
- More expressive than unimodal probabilistic regressor
- Handles sample distance in the variance model
- At inference EBM have to solve an optimization problem in order to find the Y such that the score is maximized. Since with the EML method we are already solving an optimization problem, inference comes for free.
EBM Cons:
- Requires upper and lower bound for the Y values (not a big deal)
- Higher training time (can be solved by training on GPU)
- More difficult to train (loss can be NaN due to the denominator)
MILP embedding
The NN employed is the usual MLP with ReLU activations that allows an easy embedding in the MILP model with the EML library.
The main challenge is to be able to replicate the Gibbs sampling that turns the raw scores into a probability distribution.

We can choose to ignore the probabilistic interpretation and just scale the scores in the 0-1 interval using the upper and lower bounds computed by the bound propagation phase.
In order to find the lowest and more confident value (exploitation) we design an objective that combine linearly the value (y) and the confidence score (s) like in the UCB acquisition function:
However, we obtain poor performance, mainly because the partition function is necessary to obtain "sharp" predictions:

In order to find the configuration with the lowest confidence (exploration) a solution could be to solve the optimization problem:
where s is the raw score of the NN.
However further studies are necessary to determine the feasibility in MILP solver.
Another solution could be to implement the sampling strategy in MILP, but it would require to embed the NN multiple times (one for each sample along the Y) and it can be problematic to scale up.
Conclusions
The implementation of the work done so far is an ugly notebook: link github
The challenge is to design an objective for the MILP model that is able to normalize the score in the proper way and/or to obtain meaningful result (maybe exploring other acquisition functions ideas).
At that point is possible to scale and test the algorithm on a harder instance and compare it with the old method.