Improving Your Deep Learning Workflows with SLURM and Weights & Biases
Observability of jobs, model performance, and much more using two well-known tools
Created on April 3|Last edited on April 27
Comment
Thanks to the rise of advanced computing capabilities and ever decreasing costs of compute power, more and more businesses and organizations are leveraging deep learning speed up, or to completely automate, formerly manual processes. Given that machine learning’s applications are often more brittle and do not afford us the ability to leverage transfer learning like deep learning models do, more and more businesses and organizations choose to utilize deep learning (DL) for their model-building needs. Because of the large volume of data and the many-layered neural networks that we use in deep neural nets, engineers and researchers are often able to solve tasks using DL that cannot be solved in any other way.
Deep learning has revolutionized the field of artificial intelligence, paving the way for intelligent machines that can perceive, learn, and reason like humans. From image recognition for to real-time object detection in dynamic scenes to DL-powered communication tools that help us find information and create new knowledge. The versatility of DL compared to traditional ML models remains a huge advantage of the field and - as models expand, creating classes of foundational models requiring a handful or merely a few training examples, those DL models derive insights and conjure solutions from raw data with far less human labor.
The field of machine and deep learning is rapidly evolving, and organizations need to have the right tools to keep up with the pace of change. Among the essential tools for data scientists, machine and and deep learning practitioners are job schedulers and experiment tracking software. While many companies use SLURM as their primary job scheduler, others rely on Weights & Biases (W&B) to manage and visualize their machine learning experiments.
However, for organizations that use SLURM and W&B together, the integration can be challenging. In this post, we'll explore how to use SLURM and W&B together effectively to improve your machine learning workflows.
SLURM for Deep Learning
SLURM serves as an open-source, highly-scalable [workload manager and job scheduler](https://slurm.schedmd.com/overview.html) for high-performance computing nearly all Linux distributions. It is a logical choice for many users as it is designed to handle a large number of jobs, both in queues and on compute nodes, and supports both synchronous parallel jobs and job arrays. This makes it a great solution for DL workloads that require interactive, massively parallel computations such as most ML, DL, and AI workloads. Additionally, as we alluded to in a bullet point above, by utilizing SLURM as a scheduler, the rapid dispatch of hundreds of thousands or millions of tasks in parallel takes please, allowing you to scale your deep learning frameworks to tens of thousands of compute cores.
These tasks above, while integral to many deep learning workflows, are not the only ones a HPC user will encounter.
Why We Need a Helper When Working with SLURM Deep Learning Workloads
SLURM, LSF, and Ray are all commonly-used frameworks for developing DL and AI applications, regardless of whether your team is in industry or academia. Many orgs encounter legacy reasons for persisting on SLURM resource management workloads, even if their primary compute tasks involve deep learning workflows for which SLURM was not designed. Below, we aim to cover a few ways in which you can decrease the pain points and bottlenecks and increase observability and auditibility by using deep learning tools (such as Weights & Biases) for deep learning jobs.
Static Allocation Becomes Challenging
In a static allocation model, resources are allocated to a job at the time the job is submitted. This means that the resources allocated to the job cannot be changed dynamically during the job's execution. While this model works well for many scientific simulations and parallel computing applications, it is not well-suited for deep learning workflows.
Deep learning workloads are highly variable and dynamic, with the resource requirements of a job changing rapidly as it progresses. This makes it difficult to allocate resources statically at the time of submission. For example, a deep learning job may start with low memory requirements but may quickly require more memory as the model gets deeper and more complex.
Additionally, deep learning workflows often involve distributed training across multiple nodes or GPUs, which requires dynamic resource allocation and management. SLURM's static allocation model can make it difficult to manage these distributed resources effectively and efficiently.
To overcome these challenges, many organizations have turned to other job schedulers and resource managers that are better suited for deep learning workflows, such as Kubernetes, Apache Mesos, or specialized tools like Horovod with Ray, mentioned in this Uber Engineering blog post. In fact, a community-developed project makes use of Ray/Tune along with SLURM and wandb with Ray/Tune as the parent framework, with SLURM as the job instantiator, and `wandb` to log and visualize the results. Your team or org may want to explore the aforementioned tools support dynamic resource allocation and management, making it easier to manage the highly variable and dynamic resource requirements of deep learning workflows.
SLURM Was Not Built for MLOps
No doubt about it: SLURM is tremendously useful tool at the beginning of the ML or DL model lifecycle. After we’ve curated our training, test, and validation data - potentially large jobs themselves which may have benefited from SLURM resource scheduling - we’re reading to train our models. SLURM has great utility in the training stage, however, beyond that such as in the deployment and inferencing stage other tools are better suited.
A team that probably did not yet exist when your org decided to use SLURM or Ray or Kubernetes is that of the the MLOps team. Straddling several departments - data science, devOps, site reliability engineering (SRE), and with a dash of good old hands-on IT problem-solving skills – the MLOps teams ensure that models are put into deployed states properly, that the running-in-prod models are doing just that: running. MLOps staff often also oversee the creation of automated pipelines to retrain models (sometimes on ‘triggers’ such as data drift, model drift, etc.). By introducing automation and monitoring through the model building process the field of MLOps strives to unify the development and operations (DevOps) of machine learning ecosystems.
The key advantage of utilizing a unified orchestration platform capable of executing training workloads at both the research and development phases, as well as during production, cannot be overstated. A unified solution that can construct, train, and launch models is essential for enhancing organizational agility and facilitating a swift transition from research to production. In order to achieve this, it is important to seek out a platform that is compatible with contemporary cloud environments and can accommodate the distinctive inference requirements of machine learning.
Closing Thoughts
SLURM is a widely used job scheduler and resource manager in HPC environments, but its static allocation model is not always well-suited for deep learning workflows. Deep learning workloads are highly variable and dynamic, with rapidly changing resource requirements. This can make it difficult to allocate resources statically at the time of submission, especially when jobs involve distributed training across multiple thousand of nodes or GPUs. This is where Weights & Biases (W&B) comes in. W&B is an experiment tracking tool that has now added model lifecycle management capabilities, making it an ideal complement to a deep learning workflow that uses SLURM as the resource scheduler. Additionally, [the recently-released Launch product](https://wandb.ai/site/launch) may allow your engineers to trial provisional workloads, before scaling up to full SLURM-managed workflows. By providing advanced monitoring and visualization tools, W&B allows teams to track their deep learning experiments and model performance in real-time, regardless of the resource allocation method used.

With W&B, teams can easily manage the entire deep learning workflow, from data preparation and model training to deployment and inferencing. W&B's capabilities in experiment tracking, model visualization, and monitoring can help your model-building teams as well as the ML operationalization teams ensure that models are deployed and running correctly, and that they remain optimized over time.
Overall, W&B is a welcome addition to a SLURM workflow, providing the monitoring and visualization capabilities necessary for managing deep learning experiments effectively. By combining the benefits of SLURM's powerful job scheduling and resource management capabilities with W&B's advanced experiment tracking and monitoring tools, teams can streamline their deep learning workflows and ensure that their models are deployed and running correctly.
`wandb` and SLURM references
- A handy slurm-to-wandb library from a Systems Complexity grad from the University of Groningen: https://github.com/dunnkers/slurm-to-wandb
- Setting up a burstable SLURM cluster on AWS
- Initiating and scaling a W&B Sweep 🧹 across multiple nodes on SLURM
- Adding SLURM nodes to a previously initiated W&B Sweep
- Examples provided in both PyTorch and TensorFlow
- Using RayTune, Optuna, and SLURM with `wandb` as our experiment tracker -https://github.com/klieret/ray-tune-slurm-demo - from postdoc researchers at the interdisciplinary University Research Computing group at Princeton
Add a comment