Introducing Serverless RL

The easiest and fastest way to train AI agents with reinforcement learning without worrying about GPUs
Created on October 6|Last edited on October 8
Comment
﻿
Try a quickstart now﻿
﻿
Reinforcement learning (RL) is a highly effective way to fine-tune large language models (LLMs) for agentic tasks but access to GPUs has been a major barrier. Today we are launching the industry’s first Serverless RL offering as part of W&B Training, powered by CoreWeave. The goal is simple: make RL accessible to every business and every AI engineer. W&B Training includes ART, a flexible RL fine-tuning framework; RULER, a universal verifier; and a fully managed serverless RL backend on CoreWeave, so you can run RL loops without provisioning, configuring, or operating infrastructure.
﻿
Here is a brief overview. For the full story on why we built this and how to get started, read my in-depth post on OpenPipe. You can also get started with the quickstart here. 
Why RL matters for building AI agentsReinforcement learning is the second stage in the modern agentic LLM pipeline: pretraining on large corpora, then post training with techniques like RL to ground, align, and unlock agentic behavior. RLHF popularized this phase, and newer methods such as Group Relative Policy Optimization (GRPO) improve efficiency and stability. 
For agentic tasks, RL often determines whether a model ships to production or stays a demo. Advanced AI labs already run this playbook, yet the models they release rarely meet real user targets out of the box for businesses trying to use them. So on-the-job RL is what lifts reliability, latency, and cost to production levels for these companies.
The bottleneck and the solution: the industry’s first Serverless RLDespite its value, RL has been hard to adopt for most companies because GPUs are scarce and orchestration is complex. Well-funded AI labs can procure GPUs, tune clusters, and maintain custom RL stacks. Most of the other companies cannot. They spend days evaluating providers, reserving capacity, writing launch scripts, and tuning clusters. 
W&B Training introduces the industry’s first publicly available Serverless RL as a direct fix. You get instant access to CoreWeave GPU capacity with elastic scaling and no provisioning. The service integrates the open source Agent Reinforcement Trainer (ART) and the RULER universal verifier, so you can start RL poat-training in minutes without thinking about GPUs and infrastructure.
How it worksThe first release of W&B Training includes three components:
ART integration: The popular open-source Agent Reinforcement Trainer (ART) is integrated into the service. Learn more in the ART GitHub repo.
RULER: The open-source universal verifier for RL rewards is included through the ART integration.
Serverless RL: A serverless reinforcement learning (RL) backend that runs on CoreWeave GPU clusters, with orchestration and autoscaling handled for you.
Simply write your environment and agent code. Add RL with a few lines of ART and provide the scenarios you want the agent to train on (for example, the questions a user might ask an email agent). Then select Serverless RL as the backend for your ART training loop. That’s it! Sit back as your agent explores the environment, collects rewards, and updates the model’s low-rank adaptation (LoRA) weights.
Behind the scenes, Serverless RL routes your rollout requests from ART to W&B Inference and sends training steps to a separate GPU cluster. With ART, you control the RL loop—when and how long to train, and when to switch to inference. Serverless RL hot-swaps your trained LoRA on W&B Inference between training steps to keep updates on-policy, so the trained LoRA stays close to the inference version used for rollouts.
﻿
﻿
Serverless RL automatically logs metrics and traces at every training step to Weights & Biases so you can inspect and monitor convergence and stability. It also handles checkpointing the LoRA head and saving the final version as an artifact in W&B Registry. 
And best of all, production inference is integrated, so you don’t have to download and deploy your post-trained models. When you’re ready, use the Serverless RL API to run production inference—just specify the LoRA weights to apply, and the service will dynamically load your trained weights from W&B Registry at inference time.
For a detailed description of how Serverless RL maximizes training GPU utilization and delivers low latency at low cost (you pay only for generated tokens at each training step, not for idle GPU time), see our in-depth blog post.
What you can do with Serverless RLShip AI agents to production: Move from POC to production with dependable, high-performance agents that meet reliability and latency goals while keeping unit economics in check.
Start training in minutes: Launch RL jobs quickly with ART and a GRPO harness so you focus on objectives and rewards rather than plumbing and cluster ops.
Capacity when you need it: Tap CoreWeave GPUs on demand and scale with your workload. The service automatically scales up during training bursts and down to zero when idle so cost tracks usage.
No ops burden: We operate and harden the CoreWeave-backed GPUs. You write the agent and environment code on your laptop and let us run the distributed training and manage infrastructure on your behalf.
Better performance at lower cost: Utilization-aware scheduling keeps GPUs busy, shares resources during rollouts, and bills only for incremental tokens. In our ART-E tests, wall-clock time improved by about 1.4x and costs fell by up to 40% with no quality loss.
Faster feedback loop: Training and inference run on separate, always-on CoreWeave instances, so edits to rollouts or the RL loop apply in seconds and shorten your iterate-train-ship cycle.
Production inference: With LoRA checkpoints versioned as artifacts and Serverless RL’s production-grade inference, cutovers from training to production and subsequent training resumptions are instantaneous, paving the way for continuous online learning for your agents.
Built-in observability: Serverless RL integrates with Weights & Biases, automatically capturing training metrics and trajectory traces so you can inspect how your agent learns and adjust the training script, agent harness, rewards, and environment to drive convergence. 
This is just the beginning GPU access and scaling have been the biggest blockers to RL adoption, so we started there. With ART and RULER, we’ve also addressed key RL algorithm and reward-design challenges. Our ultimate vision is to deliver continuous online learning, where models learn from experience like a new teammate does. We will keep adding layers of abstraction that automate more of the RL stack, from GPUs to tokens and everything in between.
Get startedFollow the step-by-step tutorial to train your first model. You can also dive into the API docs and review pricing.
﻿
﻿
Add a comment
Tags: Articles, Reinforcement Learning
Iterate on AI agents and models faster. Try Weights & Biases today.