Current best practices for training LLMs from scratch

Current best practices for training LLMs from scratch

Introduction

Although we’re only a few years removed from the transformer breakthrough, LLMs have already grown massively in performance, cost, and promise. At W&B, we’ve been fortunate to see more teams try to build LLMs than anyone else. But many of the critical details and key decision points are often passed down by word of mouth.

The goal of this white paper is to distill the best practices for training your own LLM for scratch. We’ll cover everything from scaling and hardware to dataset selection and model training, letting you know which tradeoffs to consider and flagging some potential pitfalls along the way. This is meant to be a fairly exhaustive look at the key steps and considerations you’ll make when training an LLM from scratch.

The first question you should ask yourself is whether training one from scratch is right for your organization. As such, we’ll start there:

Pros

Option 1
Use the API of a commercial LLM
  • Requires the least LLM training technical skills.
  • Minimum upfront training / exploration cost, given main cost incurs at inference time.
  • The least data-demanding option. Only a few examples (or no examples) are needed for models to perform inference.
  • Can leverage the best-performing LLMs in the market and build a superior experience.
  • Reduce time-to-market of your apps and de-risk your project with a working LLM model.
Option 2
Use an existing open-sourced LLM
  • A good way to leverage what LLMs have learned from a vast amount of internet data and build on top of it without paying for the IP at inference.
  • Compared to option one, you are less dependent on the future direction of LLM service providers and thus have more control regarding roadmap & backwards compatibility.
  • Compared to option three, you have a much faster time-to-value given you are not building LLMs from scratch, also leading to less data, training time, training budget needed.
Option 3
Pre-train an LLM by yourself or with consultants
  • Compared to options one and two, you have the most control of your LLM’s performance and future direction, giving you lots of flexibility to innovate on techniques and/or customize to your downstream tasks.
  •  Gain full control of training datasets used for the pre-training, which directly impacts model quality, bias, and toxicity issues. In comparison, those issues are less controllable in option one or two.
  • Training your own LLM also gives you a deep moat: superior LLM performance either across horizontal use cases or tailored to your vertical, allowing you to build a sustaining advantage especially if you create a positive data/feedback loop with LLM deployments.

Cons

Option 1
Use the API of a commercial LLM
  • Commercial LLM services can get expensive with a high volume of finetuning or inference tasks. It comes down to LLM total-cost-of-ownership (TCO) amortized to each inference.
  • Many industries / use cases forbid the use of commercial LLM services as sensitive data / PII data cannot be seen by the service for compliance (healthcare use cases, for example).
  • If building external apps, you’ll need to find other moats and de-risk your business if you’re highly reliant on external LLM service technology.
  • Less flexible downstream: doesn’t support edge inference, limited ability to customize the model (finetuning gets expensive), limited ability for ongoing model improvements
Option 2
Use an existing open-sourced LLM
  • Not as demanding as building your own, but still requires lots of
    domain expert skills to train, finetune, and host an open-sourced LLM. LLM reproducibility is still a
    significant issue so the amount of time and work needed cannot be underestimated.
  • Slower time-to-market and less agile if you are building downstream apps, due to a more vertical tech stack.
  • Open-sourced models typically lag performance compared to
    commercial models by months/
    years. If your competitor leverages commercial models, they have an advantage on LLM tech and you’ll need to find other competitive advantages.
Option 3
Pre-train an LLM by yourself or with consultants
  • Very expensive endeavor with high risks. Need cross-domain knowledge spanning from NLP/ML, subject matter expertise, software and hardware expertise. If not done well, you could end up in a situation where you’ve spent thousands or even millions of dollars with a suboptimal model. Mistakes, especially late into training stages,
    are hard to fix / unwind.
  • Less efficient than option two. Option two leverages existing LLMs, learning from an entire internet’s worth of data and can provide a solid starting point. With option 3, you start from scratch and need lots of high-quality / diverse datasets for your models to gain generalized capabilities.

When to consider each option

Option 1
Use the API of a commercial LLM
  • Best if you either have less technical teams but want to leverage LLM
    techniques to build downstream apps, or you want to leverage the best-in-class LLMs for performance reasons (outsourcing the LLM tech).
  • Good if you have very limited training datasets and want to leverage an
    LLM’s capability to do zero/few-shot learning.
  • Good for prototyping apps and exploring what is possible with LLMs.
Option 2
Use an existing open-sourced LLM
  • Between options two and three, if you aren’t trying to change the model architecture, it is almost always better to either directly take an existing pre-trained LLM and fine-tune it or take the weights of an existing pre-trained LLM as a starting point and continue pre-training. The reason is because a good pre-trained LLM like GPT-NeoX has already seen a vast amount of data and thus has learned general capabilities from the data. You can leverage that learning especially if your training dataset is not huge or diverse.
  • Another typical scenario is that you operate in a regulatory environment or have user / sensitive data that cannot be fed to commercial LLM services. Or you need edge deployment of the model for latency or locational reasons.
Option 3
Pre-train an LLM by yourself or with consultants
  • Best if you need to change model architecture or training dataset from existing pre-trained LLMs. For example, if you want to use a different tokenizer, change the vocabulary size, or change the number of hidden dimensions, attention heads, or layers.
  • Typically, in this case the LLM is a core part of your business strategy & technological moat. You are taking on some or a lot of innovations in LLM training, and have a large investment appetite to train and maintain expensive models on an ongoing basis.
  • Typically, you have or will have lots of proprietary data associated with your LLM to create a continuous model improvement loop for sustainable competitive advantage.