Physical AI at GSK: Unlocking business value with foundation models and digital twins
When GSK set out to serve 1.4 billion patients annually across 37 manufacturing sites worldwide, with a goal of reaching 2.5 billion by 2031, Sander Timmer and his AI team faced a daunting reality: the pharmaceutical supply chain is one of the most complex decision networks on earth. Making medicines and vaccines takes years, cycles are long, and every decision has massive downstream impacts that ripple across continents.
“It’s not easy,” said Timmer. “You can’t make decisions in silos. You need to integrate the entire complex decision network, understand KPIs, map them together, and feed them to agentic systems so you can simulate the impact of decisions.”
What started as an effort to move beyond static BI dashboards has evolved into something far more ambitious: a comprehensive “Physical AI” strategy that spans agentic supply chains, real-time digital twins controlling living biological processes, and an AI infrastructure foundation supporting cutting-edge innovation across global manufacturing sites and complex workflows.
At the center of this transformation? Weights & Biases, serving as the unified platform that brings order to GSK’s sprawling multi-cloud, multi-vendor AI ecosystem spanning thousands of production models.
Read on to learn how GSK has built the world’s most sophisticated pharmaceutical AI operations platform, and why W&B’s Weave and Models became indispensable to making it all work.
The impossible complexity of pharmaceutical AI
GSK’s AI challenges are unique in their scale and consequences. The company processes demand planning and forecasting, sourcing and procurement, logistics and transportation management, warehouse management, inventory optimization, and final mile delivery—all while maintaining the stringent quality controls and regulatory compliance that pharmaceutical manufacturing demands.
The team needed to overcome several critical challenges that made traditional AI approaches insufficient:
Zero tolerance for error: In pharmaceuticals, false negatives are absolutely unacceptable. When AI makes a decision affecting patient safety or drug quality, the basis and history of that decision must be recorded more precisely than human decision-making.
Extreme operational complexity: With manufacturing sites using Google Cloud, Microsoft Cloud, Databricks, NVIDIA Edge, and multiple other vendors, getting a unified view of AI operations across different platforms, sources and deployment environments was extremely challenging.
The “black box” problem in regulated industries: Without comprehensive tracing, it’s impossible to understand why models produced specific outcomes. For audits, inspections, and continuous improvement, GSK needed complete visibility into every AI decision.
Constant foundation model evolution: The rapid pace of new LLMs and frameworks creates continuous regression risk—new features can unintentionally break existing agent capabilities and degrade performance in production.
Moving from reactive to proactive MLOps: GSK’s initial approach was entirely manual and reactive. Nine sites around the world ran local inference, stored results in a central database that humans validated, and reviewed accuracy in monthly PDF reports. When problems emerged, it could take five months from discovery to fix deployment.
Building the agentic supply chain
Rather than continue relying on BI dashboards where humans manually investigate every anomaly, Sander and his team envisioned a fundamentally different paradigm: agentic systems that work with data, not just report on it.
“The previous approach was: person in charge views KPIs on a dashboard, if there’s an abnormality someone checks by phone or email, then makes decisions by mobilizing experience and intuition,” explained Timmer. “Now, agents automatically detect anomalies, present root cause hypotheses to the brand manager, simulate scenarios with downstream impacts, and people can focus on choosing which action to adopt.”
To power this vision, GSK collaborated with Microsoft to build AIGA, an internal framework for GenAI and agentic solutions. Rather than simply connecting LLMs, AIGA orchestrates a sophisticated flow: searching short-term memory, meeting notes, and standard operating procedures; planning which apps, tools, and data to access; then moving to inference and answer generation—with every step fully traceable.
But building agents that work in development is vastly different from running them reliably in production at pharmaceutical scale.
The agentic AI playbook with W&B Weave
GSK quickly discovered that operating LLM-based agents in production presented unique challenges. Evaluating subjective text outputs, preventing regression as the system evolves, and maintaining governance over which model versions are in production all proved significantly more complex than traditional ML operations.
“What we’ve been doing with Weights & Biases is really bringing in that core piece in the middle layer—the evaluation and tracing from W&B Weave inside our framework,” said Timmer. “Day-to-day, teams use our framework, so out of the box they get all the tracing from W&B Weave. From day one, everything is traced.”
GSK developed what they call the Agentic AI Playbook with three core pillars, all powered by W&B Weave:
Knowledge capture: User feedback is logged in the W&B Weave dashboard, allowing teams to review what humans like and dislike, creating a “Golden Dataset” from real operational feedback. This captures the semantic knowledge that’s impossible to evaluate with traditional metrics.
Rapid prototyping: Using reusable GenAI evaluation metrics and scorers in Weave, teams can quickly test new LLMs or different agent strategies against their Golden Dataset. “Because we have this golden dataset, we can do much more rapid prototyping,” said Timmer. “If there’s a new LLM or a different strategy we want to test, we can actually test it quickly and accurately with the easy scorers in W&B Weave.”
Monitoring and governance: All production traces are logged using W&B Weave, eliminating the black box problem. “W&B Weave has been invaluable in tracing and monitoring the products when they go live, so we no longer have a black box and can actually measure everything and fix things easily when needed,” said Timmer.
Physical AI: Digital twins controlling living biological processes
While agentic supply chains help to automate and manage planning and decision-making, GSK’s “Physical AI” strategy extends into the physical manufacturing processes themselves, in the notoriously complex world of biomanufacturing.
Vaccine manufacturing involves living cells that behave slightly differently each time, with countless variables including temperature, nutrition, raw material vendors, and process duration. Previously, processes followed fixed schedules—for example, always running for exactly three days. If something went wrong late in the process, manufacturers would have to start over from scratch.
To help solve this problem, GSK built digital twins operating at three levels of sophistication:
Digital Models: Simulations based on past data, leveraging in-silico experimentation to understand process behavior.
Digital Shadows: Real-time monitoring of current state and prediction of future state based on present sensor data.
Digital Twins: Real-time control actions for optimal outcomes, with systems recommending when to stop processes for maximum quality—such as “this batch should stop at 3 days plus 2 hours” rather than a fixed 3-day schedule.
The system uses real-time sensors feeding information to ML workflows that predict batch trajectories and recommend optimal actions. GSK has optimized 13 key products using this approach, improving both yield and product quality while reducing batch-to-batch variability.
TwinOps: Managing thousands of models with W&B Registry
Creating effective digital twins was only half the challenge. The real breakthrough was building an operational framework—what GSK calls “TwinOps”—to manage, monitor, and continuously improve these models across a heterogeneous technology landscape.
“Originally we were building everything side by side in a single instance of AzureML or Databricks MLFlow, but we had issues getting a unified view across all our sites and campuses,” said Timmer. “We started integrating W&B Weave and W&B Models to really start tracing when we have humans making decisions. Through W&B Weave, we can trace whether predictions were accepted or not.”
W&B Models, specifically the Registry, became the foundation for managing production models at scale. GSK uses the Registry to track what models were in production at which sites at what times, what performance they achieved, and to run retrospective experiments to see if different models would produce different results.
The team also built a Value Dashboard on top of the W&B infrastructure—a real-time view showing how much value each AI model generates in terms of additional doses produced and cost savings achieved.
“We make heavy use of the Registry so we know what models were in production at what site at what time, what was the performance,” explained Timmer. “We try to use the Registry and W&B as our core ML platform to really project how much money we’re actually making, how many extra doses we’re making, by using AI technologies.”
This matters enormously when operating at GSK’s scale. The company forecasts they’ll eventually manage 30,000 models in production, making a centralized registry absolutely essential for governance, auditing, and continuous improvement.
The single lens view: Unifying a complex multi-vendor ecosystem
Perhaps the most critical value W&B provides is solving GSK’s vendor fragmentation challenge. With Google Cloud, Azure, Databricks, C3, NVIDIA Edge, and constantly emerging startup products all running simultaneously across different sites, manually tracking AI operations would be impossible.
“We have a very complex ecosystem—we’re a very large manufacturing company so we have a lot of vendors, and we don’t have control of what vendors are using,” said Timmer. “It’s extremely hard to get a unified view of what’s going on with all these platforms.”
W&B provides what GSK calls a “single lens view“—a unified registry that serves as the single source of truth for which models are in production, regardless of where inference actually happens.
“What we’re trying to build with W&B is a single lens view where we can say: this is our Registry of Truth,” said Timmer. “These are the models that are in production. If we get an inspection or audit we can easily say these are the models we are using, that were used then, and it doesn’t matter where the inference is happening because we decoupled that. It’s really important to us because we’re scaling to thousands and thousands of models.”
This architecture allows GSK to operate on existing infrastructure with no vendor lock-in, creating no disruptions in model development while delivering a single source of truth for model monitoring and MLOps activities across the entire AI ecosystem.
From reactive PDFs to proactive automated MLOps
The transformation from reactive to proactive operations is perhaps best illustrated by GSK’s petri dish inspection system. To maintain sterile environments, GSK installs 3 million petri dishes annually at each site. Regulatory requirements originally mandated two people double-checking every dish, and GSK moved to a “one person plus AI” system.
Creating an accurate deep learning model wasn’t the challenge. Operating it at scale across nine global sites processing 3 million images per year revealed the real problem: the system was entirely reactive. Inference happened at each site, results went to a central database, monthly PDF reports were reviewed in meetings, and when accuracy dropped, someone requested model fixes. It took five months from problem discovery to fix deployment.
With W&B, GSK built an automated, proactive MLOps system. All sites and inference results integrate into a single platform where agents automatically detect data drift and model drift. When accuracy starts declining, the system begins training the next model version—so by the time business stakeholders review the situation, they receive both a problem description and a model with suggested improvements.
“We’re trying to bring it all to a single platform where we can build agents that detect when data or model drift is happening,” said Timmer. “The development team can start building a new model, so by the time you report to business stakeholders that the model doesn’t work, you already have a new model.”
The monthly PDF was replaced with a real-time dashboard where site managers instantly check daily processing volumes and accuracy rates, treating AI model status as a common operational language.
Building the foundation for humanoids and VLA
Looking even further ahead, GSK has begun exploring humanoids and Vision-Language-Action models for manufacturing sites. Many processes – appearance inspections, visual confirmations, manual patrols – still require human labor. Combining robots with VLA models could automate these tasks.
But in pharmaceuticals, where false negatives are absolutely unacceptable, the barrier isn’t technical capability, but operational rigor. When machines make decisions affecting patient safety, every decision must be traceable: under what conditions, which model, what accuracy history, what was the basis for the decision?
“Current AI technology isn’t able to get the false-detection rate low enough yet—we have zero tolerance for false negatives,” said Timmer. “It’s critical to track and trace all inferences to ensure continuous model and system improvements and enable humans-in-the-loop.”
W&B’s experiment management, tracing, and registry systems provide the foundation GSK needs. Having the infrastructure to support complete records of every decision, every model version, every accuracy is critical before humanoids and VLA can safely operate in pharmaceutical manufacturing.
With W&B providing the unified platform that brings coherence to GSK’s complex multi-vendor ecosystem, the company continues pushing toward its ambitious goal: serving 2.5 billion patients annually by 2031, powered by AI systems that are not just intelligent, but transparent, auditable, and continuously improving.
“We’re building the foundation,” said Timmer. “AI is here and it’s reshaping pharma beyond imagination.”