How Scaled Cognition built reliable LLMs with 114% higher accuracy for regulated industries with Weights & Biases

"It was neat and easy to integrate. The Python SDK is very straightforward and the W&B platform provides useful and detailed monitoring for our training jobs—that's why we started using it."
Anthony Platanios
VP of Research

Scaled Cognition builds super-reliable intelligence—and the tooling to deploy it. Their proprietary LLM family, led by APT-1, powers AI customer support agents in regulated industries like banking and healthcare, where anything less than reliable isn’t an option.

Why Enterprise CX requires a different approach to AI


General-purpose LLMs, while powerful, are often misaligned with the demands of enterprise customer experience (CX). Optimized for engagement, these models can fabricate responses and override company policies. Compounding the challenge, CX lacks the objectively verifiable outputs, making quality evaluation more complex. Scaled Cognition needed a solution that could reliably enforce policy compliance while providing a robust framework to evaluate performance across nuanced scenarios.

Low friction with Weights & Biases

 
Scaled Cognition realized that solving this problem required some fundamental changes to how those models operate and to the way in which they are trained. To this end, they built a novel synthetic data generation approach, combined it with some modifications to the underlying models and training algorithms, and started training their own models with reliability being the primary objective.

When evaluating tools for training, Scaled Cognition’s research team considered several options before landing on Weights & Biases and the decision came down to low friction. The Weights & Biases Python SDK plugged directly into their existing training workflow with minimal code changes, letting the team add metrics, track experiments, monitor system resources, and pull up live dashboards without rearchitecting their pipeline or managing complex integrations.

“It was neat and easy to integrate [using the Weights & Biases Python SDK],” said VP of Research Anthony Platanios. “The Python SDK is  very straightforward and the W&B platform provides useful and detailed monitoring for our training jobs — that’s why we started using it.”

A single source of truth for model training


Weights & Biases serves as the central observability layer for Scaled Cognition’s model training pipeline. Every training run is logged through the W&B SDK, which captures the full suite of metrics the research team cares about, from custom loss functions to accuracy and consistency measures like PassK, which tracks how often the model produces the correct behavior across repeated trials for identical scenarios.

The entire research team uses W&B, and runs are stored with persistent links in the team’s internal dashboards. When an experiment needs to be reviewed, the run log provides a complete, reproducible record of what happened.

For a team training custom LLMs on bare metal GPU clusters, hardware visibility is critical. Live system resource monitoring in W&B—from GPU utilization, memory usage, to power consumption—has become some of the most valued capabilities for the team. As Anthony puts it:

“System metrics around GPU memory utilization and power usage are super helpful for knowing if something is going wrong in subtle ways that are hard to notice if you’re just looking at loss trajectories and less fine-grained system metrics.”

Finally, Scaled Cognition’s broader MLOps pipeline, from workflow orchestration, data pipelines, and evaluation dashboards, is handled through internal tooling. W&B integrates cleanly into this ecosystem as the dedicated observability layer.

Delivering 114% higher model accuracy


The ability to monitor training runs in real time and easily add and iterate on metrics as the research agenda evolves directly supports Scaled Cognition’s core mission. Weights & Biases empowers Scaled Cognition to build models with higher accuracy. Their custom training stack uses non-standard loss functions designed to account for the importance and consequences of actions rather than for the exact prediction of tokens. Tracking these novel losses alongside system health metrics in a unified, live dashboard has shortened the feedback loop between experiment design and insight.

The world’s leading companies are already using Scaled Cognition’s models and tools for the conversations they can’t afford to get wrong—twenty-four hours a day. APT-1 is extremely performant, but what sets it apart is where that performance holds up. General-purpose LLMs are typically evaluated on Pass¹ performance—getting the answer right once. Enterprise-scale CX demands far more: the same correct response across repeated executions, with zero tolerance for drift or policy compliance deviation. At Pass⁵⁰, APT-1 achieves 114% higher accuracy than the best-performing general-purpose LLM. And for the classes of errors where Scaled Cognition provides hard guarantees, the model’s performance does not decay with repetition—it stays at 100%.

“For the kinds of things that we can never get wrong, performance stays at 100% no matter how many times you try,” Anthony said. “General-purpose LLMs always have some variance—even at temperature zero. That’s a reliability gap that enterprises in regulated industries cannot afford.”