Introducing Priority-Based Job Management with W&B Launch
"Maximizing AI compute efficiency: W&B Launch unveils priority queueing for ML experiments.
Created on December 20|Last edited on December 20
Comment
ML engineers at companies like OpenAI, Cohere, and NVIDIA use Weights & Biases to run more deep learning experiments faster. In 2023, many of these organizations invested heavily in new GPUs to scale up their ML team’s training runs. But even with lots of hardware, there are frequently more experiments that MLEs want to run than can be run at once–especially on the scarcest and most powerful hardware (your H100s, TPU v5s, and Superpods).
This is where job prioritization with W&B Launch can help. Launch already lets you easily move jobs between compute environments to take advantage of available resources. Now, you can launch jobs with a specified priority level, allowing more important jobs to run before less important ones.
Let’s take a look at an example.
Jumping the queue
We’ve filled up this queue with four Medium (default) priority jobs. Since the queue can process two jobs at once, we have two running and two waiting in the Queued state:

When we submit a job, we can pick a higher priority. This could reflect the importance of the training run or, more commonly, the amount of resources needed for the runs.

After we submit the job, it jumps the two previously-submitted Medium jobs in the queue.

When we submit two more jobs–one Low and one Critical priority- they appear at the bottom and the top, respectively.

The result is that our highest priority jobs–the Critical and High–get picked up first, jumping the Medum and Low priority jobs on the queue.
The MLOps View
From an MLOps leader's perspective, we want to turn the considerable - and ever-increasing - dollars we spend on ML infrastructure into productive ML experiments and, ultimately live ML-powered services.
Beyond generally putting more important workloads first, prioritization unlocks a valuable use case for fixed AI infrastructure: saturating otherwise-idle hardware with hyper-parameter sweeps that important training jobs can preempt, then picking up where they left off. Combined with the smarter, more scalable Sweeps on Launch features, this makes squeezing out maximum performance–or automatic tuning of LLM prompt chains–low-hanging fruit for ML teams.
Launch also provides a historical record of who submitted what jobs, with what priority, and thanks to being integrated with W&B’s system of record for experiments and models , with what results. This facilitates a fact-driven analysis of how resources can and should be most productively allocated.
Going Forward
Prioritization is available today for all new Launch queues. To add it to an existing queue, please get in touch with W&B support. This is the first in a sequence of releases allowing large teams of ML engineers to share today’s multi-million dollar AI compute clusters efficiently.
Until the next release, we’re excited to hear your feedback!
Add a comment