Skip to main content

Enhancing Performance with SciPy Optimize and W&B: A Deep Dive

In this article, we provide a brief overview of the scipy.optimize submodule interactive visualizations, using Weights & Biases to help us out along the way.
Created on June 12|Last edited on November 18
Examples of various loss landscapes of deep learning models. Source: https://losslandscape.com/gallery/
“Nothing happens in the world without a law of optimization having a hand in it” ― Leonhard Euler
Optimization is a field of applied mathematics that involves the task of finding out the maximum or minimum value that a particular function can attain. These algorithms are used throughout deep learning to train models, with perhaps the most well-known being stochastic mini-batch gradient descent.
In this article, we'll be exploring the SciPy Optimize submodule interactive visualizations, and we'll be using W&B along the way. Here's what we'll cover:

Table of Contents





Understanding Optimization Problems

Terminology and Notations

  • Objective function: This is our target function which we aim to maximize or minimize through the optimization algorithm. Usually denoted by f(x)f(x). Common objective functions in machine learning include the least squares error function.
  • Global maxima or minima: This is the maximum or minimum value that the objective function can obtain in its domain. It's mathematically defined to be
f(xˉ)f(x)x\large f(\bar{x}) \leq f(x) \hspace{1em} \forall \, x

  • Local maxima or minima: This is the maximum or minimum value that the objective function obtains in a small neighborhood of the domain. It's mathematically defined to be
f(xˉ)f(x)xNN=Neighborhood\large f(\bar{x}) \leq f(x) \hspace{1em} x \in \mathcal{N} \hspace{1em} \mathcal{N} = \text{Neighborhood}

...where the neighborhood is simply a region around some point in the domain.
There can be multiple local minimizers (or maximizers) of an objective function. There is also the notion of strict local minima (or maxima), where we only consider points that are strictly less (or greater) than other points and an isolated local minima (or maxima) if there is only one local minima (or maxima) in a neighborhood.
If we use the word "maxima" or "minima" we will be referring to the global maximum or minimum.
💡
The problem statement of optimization is typically of the following form:
minf(x)xCRn\large \text{min} f(x) \hspace{2em} x \in \mathbb{C} \subseteq \mathbb{R}^n

Throughout this article, we'll consider all functions to be smooth, meaning that their second derivatives exist and are continuous (the mathematician lingo being "sufficiently nice").

Types of Optimization Problems

Optimization problems can be broadly classified into the following categories:
  • Unconstrained Problems: If, in the problem, the domain of the objective function is taken to be Rn\mathbb{R}^n then the problem is unconstrained.
  • Constrained Problems: If in the problem, the domain of the objective function is taken to be a subset such as CRn\mathbb{C} \subset \mathbb{R}^n then the problem is constrained.
There's also another way of classifying optimization problems: discrete and continuous optimization.

Key Features of SciPy Optimize

Overview of available algorithms

SciPy's optimize module provides us with a number of algorithms and constraints for solving various optimization problems.
There are a variety of methods available to us, such as newton, secant, Nelder-Mead, CG and BFGS. We'll go into more detail about some of these methods later in the article.
Constraints are broadly classified into two categories in the optimize module.
There is also a special class called Bounds for simple bound constraints.

Introduction to various optimization techniques

The most common strategy to optimize any given objective function is to iterate through a bunch of steps, ensuring that the value of the next iteration is less than the previous iteration. Further, there are two broad classes of methods to decide where to move to from any given step in the process.
  1. Line Search: In this class of methods, the algorithm chooses a direction and searches for the minima along this direction from the current step.
  2. Trust Region: In this method, the information we have about the objective function is used to construct another function that performs similarly to the actual objective function. Since our model function isn't a good approximation of the true function, we limit our search to a neighborhood of the current step.

Solving Unconstrained Optimization Problems

Overview of Unconstrained Optimization

Unconstrained optimization problems are of the following general form:
minf(x)xRn\large \text{min} \, f(x) \hspace{2em} x \in \mathbb{R}^n

The main question in the study of optimization is how to characterize a local minimum. This characterization allows us to develop computational techniques for solving these problems. There are two conditions that we need:
  • Necessary Condition: conditions that are necessary but not sufficient for a point to be a maxima or a minima
  • Sufficient Condition: conditions that are sufficient for a point to be maxima or a minima

Overview of algorithms for unconstrained optimization

There are a number of algorithms available to us for solving unconstrained optimization problems :
  • Nelder-Mead Simplex method: ('nelder-mead')
  • Broyden-Fletcher-Goldfarb-Shanno method: ('BFGS')
  • Newton-Conjugate Gradient method: ('Newton-CG')
There are also some algorithms which rely on solving a trust-region subproblem. Some of them are 'trust-ncg', 'trust-krylov' and 'trust-exact'.

Examples and use cases

Let's look at perhaps one of the most famous optimization algorithms relevant to machine learning, Stochastic Gradient Method (SGD). It is a first-order iterative method for gradient descent commonly used for training deep learning models. It is quite a simple method involving the following steps:
  • Select an initial learning rate (commonly denoted as α\alpha) and initialize the neural network with some set of parameter values θ\theta.
  • Update the parameters of the model using the gradient of the objective function J(θ)J(\theta) with some data
θi+1=θiαθJ(θ)\large \theta_{i + 1} = \theta_i - \alpha \nabla_{\theta}J(\theta)

  • Repeat the above step until a local minima is obtained
For a more in-depth look and explanation of Gradient Descent and optimization, refer to the following video:


Handling Constrained Optimization Problems

Overview of Constrained Optimization

Constrained optimization problems are of the following general form:
minf(x)xCRn\large \text{min} \, f(x) \hspace{2em} x \in C \subset \mathbb{R}^n

subjectgi(x)0i=1,2,...,m\large \text{subject} \, g_i(x) \leq 0 \hspace{2em} i = 1,2,...,m

subjecthj(x)=0j=1,2,...,k\large \text{subject} \, h_j(x) = 0 \hspace{2em} j = 1,2,...,k

Those constraints are referred to as the inequality and equality constraints, respectively.

Overview of algorithms for constrained optimization

SciPy Optimize provides us with some algorithms for solving problems of constrained optimization, such as:
  • Trust-Region Constrained Algorithm ('trust-constr'): This algorithm uses the trust-region method of solving as described above using some bounds on the variables.
  • Sequential Least-Squares('SLSQP')

Global Optimization

It is very difficult to find the global minima (or maxima) of a function. Most naturally occurring phenomena that can be modeled have a lot of fluctuations and can often "trap" optimizers in lots of local minima (or maxima). In deep learning, it would be pretty great if we were able to find the global minima in every training run.
But as you can imagine, the loss landscapes of typical deep learning algorithms are highly non-convex and contain lots of local minimas. And if we don't choose our hyperparameters such as the step or learning rate correctly we can get stuck.
This is why we get different results if we train our models again. This can be avoid mostly if you set the random seeds currently each time and maintain a track of your hyperparameters.
Weights & Biases can help you to do so by logging the configurations. For more reference, refer to the following articles:


Practical Tips and Best Practices for Using SciPy Optimize

Preprocessing and scaling data

Preprocessing and scaling work great for optimization algorithms. For instance, in unconstrained optimization, we mostly don't know the value of the function at all points of the domain. Or even if we do, computing them would be expensive. Thus if we preprocess and scale the data within a certain range, it helps us optimize quicker.

Handling noisy or uncertain objective functions

If the phenomenon we're modeling is noisy either due to errors in measurement or the nature of the phenomenon itself, the most common approach used to optimize such objective functions is a two-fold approach.
  • A shallower step aimed at identifying sub-regions where the minima could be.
  • A much deeper step aimed at finding the local minima of the identified subregions.

Efficient usage of optimization algorithms and parameter tuning

Optimizing deep learning models is a difficult task, but there are some best practices that lead to optimal performance. We go over some of them below:
  • Choose a popular optimizer for your particular task. While SGD is the most popular, it is not the most efficient algorithm for most problems. Stick to the optimizer used in SOTA research.
  • Attend to all hyperparameters of the optimizer. Yes, all of them, no matter how minuscule they might seem, if you tune all the hyperparameters of the optimizer, you'll get the most optimal result.

Performance Evaluation and Comparison using W&B

Metrics for evaluation of various optimization algorithms

There are a number of ways in which one could compare various optimization algorithms:
  • Number of steps needed to reach the minima
  • Number of times it was needed to evaluate the function
  • Time taken to optimize
  • Memory used
In the example below, we'll use a number of iterations and function evaluations.

Comparison with other libraries and frameworks

SciPy optimize is one of many libraries available to us for optimization. Some other popular libraries include:
  • pymanopt: Python toolbox for optimization on Riemannian manifolds with support for automatic differentiation
  • optax: Optax is a gradient processing and optimization library for JAX maintained by deepmind.
  • jaxopt: Hardware accelerated, batchable and differentiable optimizers in JAX maintained by a team in Google Research.
  • geoopt: Riemannian Adaptive Optimization Methods with PyTorch optim

Benchmarks and case studies

We can use Weights & Biases' Tables to benchmark and compare various optimization algorithms. We'll benchmark and compare some algorithms using the Rosenbrock function of NN variables as described in the SciPy documentation.
f(x)=i=1N1100(xi+1xi2)2+(1xi)2\large f(x) = \sum_{i=1}^{N-1} 100 (x_{i+1} - x_i^2)^2 + (1-x_i)^2

It is known that the minimum value of this function is obtained when xi=1x_i = 1. In the below table, we can see how various algorithms perform on the following objective function:

Run set
12


Conclusion

In this article, you read through a brief overview of SciPy optimize and how using Weights & Biases to monitor your metrics can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.

Iterate on AI agents and models faster. Try Weights & Biases today.