Introducing Serverless LoRA Inference
Bring your own LoRA to serve fine-tuned models on W&B Inference
Created on November 20|Last edited on November 20
Comment
💡
W&B Inference now lets you serve custom fine-tuned models on fully managed CoreWeave GPU clusters—no infrastructure to provision or manage. Use our OpenAI-compatible Chat Completions API, reference your LoRA artifact in W&B Models along with the base model name, and we’ll dynamically load your adapter onto a preloaded base model at request time and return the response.
It just takes a few lines of code to run production inference with Serverless LoRA Inference:
#!/usr/bin/env python3from openai import OpenAIimport os# ---- Config ----WB_TEAM = os.getenv("WB_TEAM", "<your-team>")WB_PROJECT = os.getenv("WB_PROJECT", "<your-project>")API_KEY = os.getenv("API_KEY", "<your-api-key>") # or WANDB_API_KEY# ---- Client & Model ----model_name = f"wandb-artifact:///{WB_TEAM}/{WB_PROJECT}/qwen_lora:latest"client = OpenAI(base_url="https://api.inference.wandb.ai/v1",api_key=API_KEY,project=f"{WB_TEAM}/{WB_PROJECT}",)# ---- Run ----def main() -> None:resp = client.chat.completions.create(model=model_name,messages=[{"role": "user", "content": "Say 'Hello World!'"}],)print(resp.choices[0].message.content)if __name__ == "__main__":main()
Why customize models?
New LLMs top benchmarks constantly, but for company-specific tasks, a SOTA model is often too slow or too expensive. For example, a restaurant’s order taking voice agent must be simultaneously fast, accurate, and low-cost.
Reinforcement learning algorithms can post-train smaller models to deliver comparable or better reliability at a fraction of the latency and cost. In our email agent tests, reinforcement-trained Qwen3 14B achieved 96% accuracy versus 86% for Gemini 2.5 Pro, while running 12x faster and at 14x lower cost. And as a result, teams moving from demo to production almost always need to post-train a smaller model on proprietary data.

Low Rank Adaptation (LoRA) and its role in fine-tuning
LoRA is a training technique that makes fine-tuning practical and feasible. The core idea is that updating only a tiny fraction of a base model’s parameters can deliver performance close to full fine-tuning while preserving the model’s general world understanding and reasoning learned in pretraining. LoRA achieves this by inserting extra adapter layers at selected locations and training only those.
Think of adapters as a “git diff.” They store just the weight deltas between the base model and your fine-tuned variant. Because adapters are tiny compared to the base, you can load/unload them dynamically on a single preloaded model, enabling efficient training and inference across hundreds of customized variants. This also makes per-request attachment feasible for highly efficient, customized performance.
The problem with deploying fine-tuned models
When we introduced Serverless RL, customers loved the ability to flip between training and inference almost instantly. Building that yourself requires a complex pipeline to fetch the latest LoRA weights, hot-swap them for inference, resume training from specific adapters, and keep GPUs/CPUs fully utilized. We handle all of that.
Serverless RL also sparked interest from users who wanted to bring their own LoRA weights, whether trained on Weights & Biases or elsewhere.
Meet Serverless LoRA Inference
Serverless LoRA Inference lets you bring your fine-tuned LoRA weights from anywhere: W&B Models, W&B Serverless RL, your own training environment, or a third party. If you fine-tuned with LoRA on W&B Models or Serverless RL, you’re all set since your LoRA adapter is already a W&B Models artifact. If you trained elsewhere, simply upload it as an artifact using the wandb SDK. Then call our API with the base model and artifact reference.
W&B Inference hot-loads the adapter at request time onto a preloaded base model running on CoreWeave GPU clusters, executes your prompt, and returns the output. That means no clusters to provision, no images to build, and no endpoints to manage.
Because adapters are attached dynamically, you can roll out improvements by bumping the artifact version, enabling a clean handoff from training to inference and effortless serving of multiple fine-tuned variants without separate deployments.
Today, LoRAs are the definitive solution for managing and deploying customized model variants, enabling lightning-fast deployment cycles. We use W&B artifacts as a version-controlled storage layer for your fine-tuned LoRAs, which we make available for dynamic serving at inference time on powerful, hosted base models.
Key benefits
With Serverless LoRA Inference, you can deploy your custom weights instantly with no servers to provision and manage. Here’s why you should try it:
- Version-controlled serving: The artifact URI explicitly includes the project, run, and version, providing traceability back to the training and hyperparameters used to create the LoRA weights.
- Zero infra to manage: You avoid the complexity of setting up and scaling serving infrastructure for every LoRA iteration. The system handles dynamic loading and hot-swapping of your weights in the background.
- Faster iteration: Because only the small LoRA weights are updated and managed via artifacts, you can cycle from training to production validation instantly.
Getting started
Serverless LoRA Inference is now available in public preview. Read the docs and try the Serverless LoRA Inference demo notebook.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.