Should You Purchase an LLM or Train Your Own?
An excerpt from our Training LLMs from Scratch piece to help you decide if you should purchase a large language model or train your own
Created on February 27|Last edited on September 25
Comment
NB: This piece is excerpted from our recent whitepaper, "Best Practices for Training an LLM from Scratch." To download the complete guide for free, please click the blue button below.
💡
Introduction
Before starting LLM pre-training, the first question you need to ask is whether you should pre-train an LLM by yourself or use an existing one. There are three basic approaches:
- Option 1: Use the API of a commercial LLM, e.g. GPT-3 (OpenAI, 2020), Cohere APIs, AI21 J-1
- Option 2: Use an existing open-sourced LLM, e.g. GPT-J (EleutherAI, 2021), GPT-NeoX (EleutherAI, 2022), Galactica (MetaAI), UL2 (Google, 2022), OPT (MetaAI, 2022), BLOOM (BigScience, 2022), Megatron-LM (NVIDIA, 2021), CodeGen (Salesforce, 2022)
- Option 3: Pre-train an LLM by yourself or with consultants: You can either manage your own training or hire LLM consultants & platforms. For example, Mosaic ML provides training services focusing on LLMs.
That said, there are a lot of details to consider when making your choice. Here are the pros, cons, and applicable scenarios for each option:
DOWNLOAD THE FREE WHITEPAPER

Use a Commercial LLM's API
Pros:
- Requires the least LLM training technical skills.
- Minimum upfront training / exploration cost, given main cost incurs at inference time.
- The least data-demanding option. Only a few examples (or no examples) are needed for models to perform inference.
- Can leverage the best-performing LLMs in the market and build a superior experience.
- Reduce time-to-market of your apps and de-risk your project with a working LLM model.
Cons:
- Commercial LLM services can get expensive with a high volume of fine-tuning or inference tasks. It comes down to LLM total-cost-of-ownership (TCO) amortized to each inference.
- Many industries / use cases forbid the use of commercial LLM services as sensitive data / PII data cannot be seen by the service for compliance (healthcare use cases, for example).
- If building external apps, you’ll need to find other moats and de-risk your business if you’re highly reliant on external LLM service technology.
- Less flexible downstream: doesn’t support edge inference, limited ability to customize the model (fine-tuning gets expensive), limited ability for
Use an Existing Open-Source LLM
Pros:
- A good way to leverage what LLMs have learned from a vast amount of internet data and build on top of it without paying for the IP at inference.
- Compared to option 1, you are less dependent on the future directions of LLM service providers and thus have more control regarding roadmap & backwards compatibility.
- Compared to option 3, you have a much faster time-to-value given you are not building LLMs from scratch, also leading to less data, training time, training budget needed.
Cons:
- Not as demanding as building your own, but still requires lots of domain expert skills to train, fine-tune, and host an open-sourced LLM. LLM reproducibility is still a significant issue so the amount of work & time needed cannot be underestimated.
- Slower time-to-market and less agile if you are building downstream apps, due to a more vertical tech stack.
- Open-sourced models typically lag performance compared to commercial models by months/years. If your competitor leverages commercial models, they have an advantage on LLM tech and you’ll need to find other competitive advantages.
Train Your Own LLM
Pros:
- Compared to the prior two options, you have the most control of your LLM’s performance and future direction, giving you lots of flexibility to innovate on techniques and/or customize to your downstream tasks.
- Gain full control of training datasets used for the pre-training, which directly impacts model quality, bias, and toxicity issues. In comparison, those issues are less controllable in option 1 or 2.
- Training your own LLM also gives you a deep moat: superior LLM performance either across horizontal use cases or tailored to your vertical, allowing you to build a sustaining advantage especially if you create a positive data/feedback loop with LLM deployments.
Cons:
- Very expensive endeavor with high risks. Need cross-domain knowledge spanning from NLP/ML, subject matter expertise, software and hardware expertise. If not done well, you could end up in a situation where you’ve spent thousands or even millions of dollars with a suboptimal model. Mistakes, especially late into training stages, are hard to fix / unwind.
- Less efficient than option 2. Option 2 leverages existing LLMs, learning from the whole internet worth of data and can provide a solid starting point. With option 3, you start from scratch and need lots of high-quality / diverse dataset for your models to gain generalized capabilities.
Things to Consider Before Deciding
Below, we'll walk through a few additional factors that will help you decide if training an LLM from scratch is right for your business.
Reasons You May Want to Use a Commercial LLM:
- Best if you either have less technical teams but want to leverage LLM techniques to build downstream apps, or you want to leverage the best-in-class LLMs for performance reasons (outsourcing the LLM tech).
- Good if you have very limited training datasets and want to leverage an LLM’s capability to do zero/few-shot learning.
- Good for prototyping apps and exploring what is possible with LLMs.
Reasons You May Want to Use an Open-Source LLM:
- Between using an open-source LLM or building your own, if you aren’t trying to change the model architecture, it is almost always better to either directly take an existing pre-trained LLM and fine-tune it or take the weights of an existing pre-trained LLM as a starting point and continue pre-training. The reason is because a good pre-trained LLM like GPT-NeoX has already seen a vast amount of data and thus has learned general capabilities from the data. You can leverage that learning especially if your training dataset is not huge or diverse.
- Another typical scenario is that you operate in a regulatory environment or have user / sensitive data that cannot be fed to commercial LLM services. Or you need edge deployment of the model for latency or locational reasons.
Reasons You May Want to Build Your Own LLM:
- Best if you need to change model architecture or training dataset from existing pre-trained LLMs. For example, if you want to use a different tokenizer, change the vocabulary size, or change the number of hidden dimensions, attention heads, or layers.
- Typically, in this case the LLM is a core part of your business strategy & technological moat. You are taking on some or a lot of innovations in LLM training, and have a large investment appetite to train and maintain expensive models on an ongoing basis.
- Typically, you have or will have lots of proprietary data generated associated with the LLM service to create a continuous model improvement loop for sustainable competitive advantage.
It is also worth mentioning that if you only have a very targeted set of use cases and don’t need the general-purpose capabilities or generative capabilities from LLMs, you might want to consider training or fine-tuning a much smaller transformer or other much simpler deep learning models. That could result in much less complexity, less training time, and less ongoing costs.
Conclusion
Our whitepaper covers everything from determining your optimal dataset size, how to configure your hardware, how to parallelize your training, where to collect data from and how to process it, experiment and hyperparameter search suggestions, evaluation frameworks, how to mitigate bias in your LLM, and a whole lot more. If you'd like to download the entire technical guide for free, click here.
Related Reading
The Art and Science of Prompt Engineering
Darek Kleczek takes us through the art and science of how to prompt engineer like a pro.
What Do LLMs Say When You Tell Them What They Can't Say?
An exploration of token banning on GPT's vocabulary.
Understanding LLMOps: Large Language Model Operations
This article explores how large language models (LLMs) are changing the way we build AI-powered products and the landscape of machine learning operations (MLOps).
GPT-4o Python quickstart using the OpenAI API
Getting set up and running GPT-4o on your machine in Python using the OpenAI API.
Add a comment
Hi, This is Regan. I am currently operating a Chinese AI blog named Baihai IDP.
Please allow me to translate this blog post into Chinese.
I am very interested in the content of your blog post. I believe that the information in it would be of great benefit to a wider audience if it were translated into Chinese.
I would be sure to include a link to the original blog post and your name as the author. I would also be happy to provide you with a copy of the translated post for your records.
I hope you will consider my request and I look forward to hearing from you.
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.