Enhancing Performance with SciPy Optimize and W&B: A Deep Dive

In this article, we provide a brief overview of the scipy.optimize submodule interactive visualizations, using Weights & Biases to help us out along the way.
Saurav Maheshkar
Created on June 12|Last edited on November 18
Comment
﻿
Examples of various loss landscapes of deep learning models. Source: https://losslandscape.com/gallery/﻿
“Nothing happens in the world without a law of optimization having a hand in it” ― Leonhard EulerOptimization is a field of applied mathematics that involves the task of finding out the maximum or minimum value that a particular function can attain. These algorithms are used throughout deep learning to train models, with perhaps the most well-known being stochastic mini-batch gradient descent.
In this article, we'll be exploring the SciPy Optimize submodule interactive visualizations, and we'll be using W&B along the way. Here's what we'll cover:  
Table of ContentsUnderstanding Optimization ProblemsTerminology and NotationsTypes of Optimization ProblemsKey Features of SciPy OptimizeOverview of available algorithmsIntroduction to various optimization techniquesSolving Unconstrained Optimization ProblemsOverview of Unconstrained OptimizationOverview of algorithms for unconstrained optimizationExamples and use casesHandling Constrained Optimization ProblemsOverview of Constrained OptimizationOverview of algorithms for constrained optimizationGlobal OptimizationPractical Tips and Best Practices for Using SciPy OptimizePreprocessing and scaling dataHandling noisy or uncertain objective functionsEfficient usage of optimization algorithms and parameter tuningPerformance Evaluation and Comparison using W&BMetrics for evaluation of various optimization algorithmsComparison with other libraries and frameworksBenchmarks and case studiesConclusion
﻿
﻿
Understanding Optimization Problems
Terminology and NotationsObjective function: This is our target function which we aim to maximize or minimize through the optimization algorithm. Usually denoted by f(x)f(x)f(x)﻿. Common objective functions in machine learning include the least squares error function.
Global maxima or minima: This is the maximum or minimum value that the objective function can obtain in its domain. It's mathematically defined to be
f(xˉ)≤f(x)∀ x\large f(\bar{x}) \leq f(x) \hspace{1em} \forall \, xf(xˉ)≤f(x)∀x﻿
Local maxima or minima: This is the maximum or minimum value that the objective function obtains in a small neighborhood of the domain. It's mathematically defined to be
f(xˉ)≤f(x)x∈NN=Neighborhood\large f(\bar{x}) \leq f(x) \hspace{1em} x \in \mathcal{N} \hspace{1em} \mathcal{N} = \text{Neighborhood}f(xˉ)≤f(x)x∈NN=Neighborhood﻿
...where the neighborhood is simply a region around some point in the domain. 
There can be multiple local minimizers (or maximizers) of an objective function. There is also the notion of strict local minima (or maxima), where we only consider points that are strictly less (or greater) than other points and an isolated local minima (or maxima) if there is only one local minima (or maxima) in a neighborhood.
If we use the word "maxima" or "minima" we will be referring to the global maximum or minimum.
💡
The problem statement of optimization is typically of the following form:
minf(x)x∈C⊆Rn\large \text{min} f(x) \hspace{2em} x \in \mathbb{C} \subseteq \mathbb{R}^nminf(x)x∈C⊆Rn﻿
Throughout this article, we'll consider all functions to be smooth, meaning that their second derivatives exist and are continuous (the mathematician lingo being "sufficiently nice").
Types of Optimization ProblemsOptimization problems can be broadly classified into the following categories: 
Unconstrained Problems: If, in the problem, the domain of the objective function is taken to be Rn\mathbb{R}^nRn﻿ then the problem is unconstrained.
Constrained Problems: If in the problem, the domain of the objective function is taken to be a subset such as C⊂Rn\mathbb{C} \subset \mathbb{R}^nC⊂Rn﻿ then the problem is constrained.
There's also another way of classifying optimization problems: discrete and continuous optimization.
Key Features of SciPy Optimize
Overview of available algorithmsSciPy's optimize module provides us with a number of algorithms and constraints for solving various optimization problems.
There are a variety of methods available to us, such as newton, secant, Nelder-Mead, CG and BFGS. We'll go into more detail about some of these methods later in the article.
Constraints are broadly classified into two categories in the optimize module.
﻿Non Linear Constraints (NonlinearConstraint): These are non-linear constraints on the variables of the algorithm
﻿Linear Constraints (LinearConstraint): These are linear constraints on the variables of the algorithm
There is also a special class called Bounds for simple bound constraints.
Introduction to various optimization techniquesThe most common strategy to optimize any given objective function is to iterate through a bunch of steps, ensuring that the value of the next iteration is less than the previous iteration. Further, there are two broad classes of methods to decide where to move to from any given step in the process.
Line Search: In this class of methods, the algorithm chooses a direction and searches for the minima along this direction from the current step.
Trust Region: In this method, the information we have about the objective function is used to construct another function that performs similarly to the actual objective function. Since our model function isn't a good approximation of the true function, we limit our search to a neighborhood of the current step.
Solving Unconstrained Optimization Problems
Overview of Unconstrained OptimizationUnconstrained optimization problems are of the following general form:
min f(x)x∈Rn\large \text{min} \, f(x) \hspace{2em} x \in \mathbb{R}^nminf(x)x∈Rn﻿
The main question in the study of optimization is how to characterize a local minimum. This characterization allows us to develop computational techniques for solving these problems. There are two conditions that we need:
Necessary Condition: conditions that are necessary but not sufficient for a point to be a maxima or a minima
Sufficient Condition: conditions that are sufficient for a point to be maxima or a minima
Overview of algorithms for unconstrained optimizationThere are a number of algorithms available to us for solving unconstrained optimization problems :
Nelder-Mead Simplex method: ('nelder-mead')
Broyden-Fletcher-Goldfarb-Shanno method: ('BFGS')
Newton-Conjugate Gradient method: ('Newton-CG')
There are also some algorithms which rely on solving a trust-region subproblem. Some of them are 'trust-ncg', 'trust-krylov' and 'trust-exact'.
Examples and use casesLet's look at perhaps one of the most famous optimization algorithms relevant to machine learning, Stochastic Gradient Method (SGD). It is a first-order iterative method for gradient descent commonly used for training deep learning models. It is quite a simple method involving the following steps:
Select an initial learning rate (commonly denoted as α\alphaα﻿) and initialize the neural network with some set of parameter values θ\thetaθ﻿.
Update the parameters of the model using the gradient of the objective function J(θ)J(\theta)J(θ)﻿ with some data 
θi+1=θi−α∇θJ(θ)\large \theta_{i + 1} = \theta_i - \alpha \nabla_{\theta}J(\theta)θi+1​=θi​−α∇θ​J(θ)﻿
Repeat the above step until a local minima is obtained
For a more in-depth look and explanation of Gradient Descent and optimization, refer to the following video:
﻿
Handling Constrained Optimization Problems
Overview of Constrained OptimizationConstrained optimization problems are of the following general form:
min f(x)x∈C⊂Rn\large \text{min} \, f(x) \hspace{2em} x \in C \subset \mathbb{R}^nminf(x)x∈C⊂Rn﻿
subject gi(x)≤0i=1,2,...,m\large \text{subject} \, g_i(x) \leq 0 \hspace{2em} i = 1,2,...,msubjectgi​(x)≤0i=1,2,...,m﻿
subject hj(x)=0j=1,2,...,k\large \text{subject} \, h_j(x) = 0 \hspace{2em} j = 1,2,...,ksubjecthj​(x)=0j=1,2,...,k﻿
Those constraints are referred to as the inequality and equality constraints, respectively.
Overview of algorithms for constrained optimizationSciPy Optimize provides us with some algorithms for solving problems of constrained optimization, such as:
Trust-Region Constrained Algorithm ('trust-constr'): This algorithm uses the trust-region method of solving as described above using some bounds on the variables.
Sequential Least-Squares('SLSQP')
Global OptimizationIt is very difficult to find the global minima (or maxima) of a function. Most naturally occurring phenomena that can be modeled have a lot of fluctuations and can often "trap" optimizers in lots of local minima (or maxima). In deep learning, it would be pretty great if we were able to find the global minima in every training run. 
But as you can imagine, the loss landscapes of typical deep learning algorithms are highly non-convex and contain lots of local minimas. And if we don't choose our hyperparameters such as the step or learning rate correctly we can get stuck. 
This is why we get different results if we train our models again. This can be avoid mostly if you set the random seeds currently each time and maintain a track of your hyperparameters.
Weights & Biases can help you to do so by logging the configurations. For more reference, refer to the following articles:
How to Set Random Seeds in PyTorch and Tensorflow
Learn how to set the random seed for everything in PyTorch and Tensorflow in this short tutorial, which comes complete with code and interactive visualizations.
﻿
Practical Tips and Best Practices for Using SciPy Optimize
Preprocessing and scaling dataPreprocessing and scaling work great for optimization algorithms. For instance, in unconstrained optimization, we mostly don't know the value of the function at all points of the domain. Or even if we do, computing them would be expensive. Thus if we preprocess and scale the data within a certain range, it helps us optimize quicker.
Handling noisy or uncertain objective functionsIf the phenomenon we're modeling is noisy either due to errors in measurement or the nature of the phenomenon itself, the most common approach used to optimize such objective functions is a two-fold approach.
A shallower step aimed at identifying sub-regions where the minima could be.
A much deeper step aimed at finding the local minima of the identified subregions.
Efficient usage of optimization algorithms and parameter tuningOptimizing deep learning models is a difficult task, but there are some best practices that lead to optimal performance. We go over some of them below:
Choose a popular optimizer for your particular task. While SGD is the most popular, it is not the most efficient algorithm for most problems. Stick to the optimizer used in SOTA research.
Attend to all hyperparameters of the optimizer. Yes, all of them, no matter how minuscule they might seem, if you tune all the hyperparameters of the optimizer, you'll get the most optimal result.
Performance Evaluation and Comparison using W&B
Metrics for evaluation of various optimization algorithmsThere are a number of ways in which one could compare various optimization algorithms:
Number of steps needed to reach the minima
Number of times it was needed to evaluate the function
Time taken to optimize
Memory used
In the example below, we'll use a number of iterations and function evaluations.
Comparison with other libraries and frameworksSciPy optimize is one of many libraries available to us for optimization. Some other popular libraries include:
﻿pymanopt: Python toolbox for optimization on Riemannian manifolds with support for automatic differentiation
﻿optax: Optax is a gradient processing and optimization library for JAX maintained by deepmind.
﻿jaxopt: Hardware accelerated, batchable and differentiable optimizers in JAX maintained by a team in Google Research.
﻿geoopt: Riemannian Adaptive Optimization Methods with PyTorch optim
Benchmarks and case studiesWe can use Weights & Biases' Tables to benchmark and compare various optimization algorithms. We'll benchmark and compare some algorithms using the Rosenbrock function of NNN﻿ variables as described in the SciPy documentation.
f(x)=∑i=1N−1100(xi+1−xi2)2+(1−xi)2\large f(x) = \sum_{i=1}^{N-1} 100 (x_{i+1} - x_i^2)^2 + (1-x_i)^2f(x)=∑i=1N−1​100(xi+1​−xi2​)2+(1−xi​)2﻿
It is known that the minimum value of this function is obtained when xi=1x_i = 1xi​=1﻿. In the below table, we can see how various algorithms perform on the following objective function: 
﻿
Run set12
﻿
ConclusionIn this article, you read through a brief overview of SciPy optimize and how using Weights & Biases to monitor your metrics can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.
How to Compare Keras Optimizers in Tensorflow for Deep Learning
A short tutorial outlining how to compare Keras optimizers for your deep learning pipelines in Tensorflow, with a Colab to help you follow along.
How To Calculate Number of Model Parameters for PyTorch and TensorFlow Models
This article provides a short tutorial on calculating the number of parameters for TensorFlow and PyTorch deep learning models, with examples for you to follow.
How to Properly Use PyTorch's CosineAnnealingWarmRestarts Scheduler
This article provides a short tutorial on how to use the CosineAnnealingWarmRestarts Scheduler in PyTorch, along with code and interactive visualizations.
How to Initialize Weights in PyTorch
A short tutorial on how you can initialize weights in PyTorch with code and interactive visualizations.
Preventing The CUDA Out Of Memory Error In PyTorch
A short tutorial on how you can avoid the "RuntimeError: CUDA out of memory" error while using the PyTorch framework.
Setting Up TensorFlow And PyTorch Using GPU On Docker
A short tutorial on setting up TensorFlow and PyTorch deep learning models on GPUs using Docker.
﻿
﻿
Add a comment
Tags: Articles, Domain Agnostic, Tutorial, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.