Multi-domain large language model adaptation using synthetic data generation at Shell
As one of the world’s oldest multinational oil and gas companies, one of Shell’s most valuable assets is its treasure trove of institutional domain knowledge acquired and accrued over decades leading cutting-edge solutions in the space. Unfortunately, this knowledge has long been underutilized and in a fragile state, vulnerable to employee departures and silos, making it increasingly difficult for new researchers to connect the dots across decades of scientific work.
Injy Sarhan and her NLP research team sought out to tackle a daunting challenge: turning institutional domain knowledge into an AI system and AI assistant that truly understood Shell’s unique language, domain expertise, and technical context in ways that off-the-shelf LLMs simply couldn’t.
Working in collaboration with NVIDIA, and supported by Weights & Biases, they built a research assistant powered by domain-adapted LLMs that could serve as an institutional memory system, helping new researchers become more efficient and accurate while reaching their goals faster. The result was a complex data processing and model tuning pipeline and a 30% improvement in domain-specific understanding vs. baseline models, powered by a massive 9 billion token dataset and a sophisticated synthetic data generation pipeline.
Read on to learn how Shell built this future-proof innovation pipeline and fine-tuned LLM adaptation at one of the world’s largest energy companies.
Solving the challenge of speaking Shell’s language with domain ingestion and knowledge reasoning pipelines
Shell’s technical knowledge spans highly specialized domains – petroleum engineering, chemical processing, materials science, corrosion analysis, and countless other areas with their own vocabularies, concepts, and reasoning patterns. Simply plugging an off-the-shelf LLM into a RAG system proved insufficient and unscalable.
The goal was to build in-house capability to fine-tune LLMs that truly understand Shell’s language and domain, creating a future-proof innovation pipeline that serves as a system for researchers to use to work more efficiently. Sarhan took the approach to divide the system into two main pillars: domain ingestion and knowledge reasoning. The long-term goal was to eventually reason at a high level across all of Shell’s data, but they started with a focused archive of technical reports.
The pipeline consisted of four critical components, each supported by a carefully chosen software stack selected for its capability to handle Shell’s scale:
- Data preprocessing: Using NVIDIA’s Nemo Curator, MinerU, and NV-Ingest to handle AI-based text extraction and domain classification from more than 300,000 Shell technical reports to start, before ultimately refining the dataset to 154,000 technical reports and 20,000 chemical reports. Domain classification was the key to creating better benchmarks and higher quality synthetic data.
- Domain-Adaptive Pretraining (DAPT): Leveraging Nemo Framework1.0/2.0, DeepSpeed, PyTorch FSDP, and HuggingFace Transformers to ingest knowledge directly from extracted text into foundation models. This was also the most computationally intensive part of the entire pipeline.
- Instruction tuning: Using LLM eval harness with Nemo integration and Weights & Biases to manage supervised fine-tuning (SFT) as well as tracking comprehensive evaluations across both public and Shell-specific benchmarks.
“We built a custom dashboard supported by W&B logging all our experiments,” said Senior Researcher Avanindra Singh. “We also do a lot of hyperparameter tuning using W&B Sweeps to explore both impact and sensitivity. Hyperparameter optimization is so critical because we’re taking a foundation model that has already been trained and we need to ingest knowledge so it doesn’t forget its previous knowledge, but also generalizes and learns on new knowledge. We optimized for five parameters – context length, overlap split, learning rate, drop out, and weight decay.” - Synthetic data generation and evaluation framework: The breakthrough came when Shell discovered that mixing supervised fine-tuning (SFT) with DAPT led to significantly better knowledge ingestion behavior. But the scale of this approach required generating massive amounts of synthetic training data – specifically, multiple-choice questions grounded in Shell’s technical reports. Deploying NIM embeddings and using Langchain, and using Llama 3.3 405b as the instruction model, Sarhan and team developed a framework that is able to process 1,000 documents in 24 hours using 64 GPUs, and ultimately generated more than 2 million synthetic instructions across 154,000 documents.
The process was also computationally intensive, with the team using a combination of on-prem clusters (comprising 10 nodes, 8 GPUs per node, with 80 A100 GPUs) and AWS clusters (24 nodes, 8 GPUs per node, 192 H100 GPUs). Singh estimates that the team racked up 277k compute hours over a 3-month training period initially.
Building a comprehensive evaluation framework and feedback loop supported by W&B Weave
After generating such massive amounts of synthetic data for fine-tuning, they also need to double-check its quality. Sarhan and her team used basic deduplication to remove 4.5% of synthetic data, as well as used an LLM-as-judge to ensure they are grounded in the documents.
The team also needed to address the critical challenge of benchmarking a domain-adapted model to ensure Shell’s technical domains were being correctly captured. They had to rely on not just public benchmarks but also Shell-specific benchmarks covering chemistry, corrosion and other specialized domains.
“Our evaluation framework uses a LLM-as-judge to filter Shell-created benchmarks, with all results comprehensively logged in our W&B Weave dashboard,” said Sarhan. “Weave has been fully integrated into our evaluation framework.”
Their evaluation consists of two complementary approaches: MCQ analysis, testing the model’s ability to answer multiple-choice questions derived from Shell’s technical reports, and open-ended analysis, which is given a golden answer, question, and the model’s generated answer and uses a LLM-as-judge to evaluate more complex responses. The team also focused on robustness testing, rephrasing each question five times and recording consistency scores, ensuring robust model understanding that wasn’t overly sensitive to prompt variations.
A quality feedback loop to ensure constant iterative improvement is also part of the evaluation framework, as the team works to continuously refine and train models using Reinforcement Learning using Human Feedback (RLHF) methods. “We care a lot about the feedback loop, and all collected feedback is logged into W&B Weave,” said Sarhan. “With every question, users are encouraged to rate how the response was and we take all that feedback into constantly retraining our model.”
The robust pipeline and evaluation framework has led to strong results. The DAPT + instruction-tuning models led to a 26% increase in accuracy compared to non-fine-tuned Llama models, and a 30% improvement in domain-specific understanding on Shell benchmarks.
A scalable infrastructure for continuous innovation into the future
Shell’s initial efforts in creating a future-proof innovation pipeline for domain-adapted models has proven to be an early success, with the pilot version rolled out to great adoption. Having developed the capability to learn custom Shell knowledge from reports via DAPT + SFT models, in addition to a scalable synthetic data generation pipeline for benchmarking and fine-tuning, has established a strong foundation to build off of.
At the core of this scalable infrastructure is Weights & Biases, which plays critical roles in supporting experiment tracking and logging, building custom dashboards for increased visibility into training dynamics, robust hyperparameter optimization, and evaluation and feedback through both W&B Models and W&B Weave.
Sarhan and her team has a vast vision on their future roadmap, including complete reasoning models that move beyond knowledge retrieval to perform complex multi-step reasoning grounded in Shell’s domain expertise, multi-modality with Vision-Language Models, and expansion to other Shell domains.