Skip to main content

The Lifecycle of a GenAI Project

A primer on why to build with LLMs, how to build with LLMs, and what goes into building with LLMs
Created on April 9|Last edited on April 16
Last month, we released the first in a series of posts where we’ll be building LLM apps together. We took a model from 17% accuracy to over 90% by just scoping our project properly and adjusting our prompt through chain-of-thought and providing clear examples.
One thing we didn’t cover is the lifecycle of a GenAI project and why so many companies are prioritizing projects like ours. That’s what we’re going to do today. We’ll begin by explaining the benefits of building with LLMs, then walking through the common steps to doing so. This is informed by our own work here at W&B building LLMs for internal processes as well as what we see our customers doing right this moment.
We’ll be back soon with the sequel to that post, but for now, let’s look at the why.

Table of Contents



Reasons to Build with LLMs

To put it simply: LLMs massively expedite the ability for anyone to quickly build useful applications. In traditional machine learning, the steps for putting a useful model in production—everything from getting data to training the model to fine-tuning it to deploying—took months. Meanwhile, with many LLMs, you can write good prompts and iterate to find one that works for you in a matter of minutes. You can frequently start making API calls to the model and making inferences in a couple hours.

Additionally, getting great results from LLMs also doesn’t require knowledge of deep learning models, programming, math or statistics. You can simply use natural language to interact with them. This means anyone can get massive value from them. And while at the beginning of 2024, ChatGPT was much better than other models out there, open-source projects have really caught up, making something like Mistral really ideal for building models that won’t break the bank.
In addition to being faster and requiring less coding chops, there’s something else about LLMs that’s different from traditional software: In software, the output is the code that produces an application. But with LLMs the output is all the things you tried along the way – the data, prompts, pipelines, evaluation metrics etc. You should think of those experiments and lessons as your company’s IP. While this isn’t exactly a reason why people are leveraging LLMs, it is an important thing to keep in mind as you’re building.
Next, let’s look at the typical steps in one of these projects.

The LLM Lifecycle: From Idea to Production

The figure below represents a typical LLM project lifecycle. Keep in mind there's a good amount of iteration in the purple section.


1. Define the scope

We covered this more extensively in our last piece, but it’s worth underlining here: you really want to define your scope narrowly at first. Attacking a big, complex problem from jump street opens you up to frustration while achieving high accuracy on a smaller part of your problem gives you a baseline you can improve on, while also validating your data and the general direction of the project.
Defining the scope of your use case also helps you pick what size of model might best fit your needs. This step is where you’ll decide if you want to start with prompt engineering, fine-tuning an off-the-shelf model, or train your own model from scratch. For a simple translation or summarization task, an off the shelf model might do the trick. If you want it to respond in a specific tone or answer questions about proprietary data, it might be worth looking into fine-tuning. Being specific about what your model needs to do can save you both time and compute cost.

2. Prompt engineer

We went from 17% to over 90% with just a few smart prompt engineering tweaks in our last post. And while prompt engineering is a bit of an art, it’s an intuitive one. Gauge how your model evolves with each prompt change and double down on tactics that are working well for you. Eventually, as we found out, you can expect your performance to plateau before moving on to more technical improvements but do realize that prompting alone can make a really massive difference in overall performance.

3. Fine-tune

Some tasks may require fine-tuning up front. Typically, these might be tasks where an LLM hasn’t seen enough data like yours. That can be something like proprietary, internal sales data or medical data (due to regulatory guardrails). Fine-tuning is simply a way to train a general LLM on novel data to improve performance on your specific problem.

4. Evaluate

Without having a good way to evaluate the model performance, we lose valuable learning and any experiments we do are in vain. We want to evaluate our models both while building our LLM app, and when the app is running in production. We might also look into model limitations like the tendency to hallucinate and build in fail-safes to prevent them.
Simply put: evaluating the model performance against custom metrics or benchmarks can help compare the various techniques you experiment with and make sure you’re actually building something that will be useful in production.

5. Iterate

You almost certainly won’t get acceptable performance the first time through the last three steps. Expect to tweak your prompt, fine-tune, and re-evaluate against your original baseline. This is a big part of what we mentioned earlier in this piece: these steps are effectively your company’s IP and you can use what you learn here to build more complex LLM applications or start on new ones.

6. Deploy

Once we have a model with acceptable performance, we can optimize it for deployment to make sure it’s making the best use of compute resources and deploy it into our app.

The Basic Components of an LLM App

Now that we understand the why and the how, let’s look at the what: our LLM app. Seasoned builders will likely know a lot of these concepts, but if you’re newly inspired to build, we recommend familiarizing yourself with these ideas. We’ll be coming back to these a lot!

Prompts

A prompt is the instructions we give to the LLM to produce predictions (the output). Like in cooking where better ingredients create better food, clearer instructions to the LLM lead to better outputs.

Typical prompts have three main components:

System message

This sets the tone, personality and overall behavior of the assistant. You can also use it to provide specific instructions about how it should behave throughout the conversation (e.g. only answer questions about your specific problem as opposed to general knowledge the LLM was trained on). The system message is never visible to the user but it allows us to direct the LLM’s behavior to our unique use case. E.g. “You are an expert Go player.”

User message

The specific instruction to carry out, given the higher level behavior in the system message – e.g. the user’s question would go here. E.g. “Given this Go board, make a move. … <Go board>”

Assistant

You can use this to feed the bot things it has said previously, so it can continue a conversation. If you’ve interacted with ChatGPT, you’ll be familiar with this. E.g. examples of the Go gameplay so far.

Tokens

Context windows are typically measured in tokens. Tokens are chunks of text that the model reads or generates. A token is not usually a word but a smaller unit like a character or a part of a word, or, less frequently, a larger one like a whole phrase. You can see some examples of how the model chunks in the text you pass in as tokens in the image.


Model Parameters

Think of parameters as the model's memory. More parameters means more memory, which means the model can perform more sophisticated tasks.

Completion and Inference

Finally: the output of the model is a completion. Using the model to generate text is inference.

Conclusion

Now that you understand the reasons people are building LLM apps and the general components and lifecycle of these apps, we invite you to dive into a real project! We’ve started building a customer success ticket classifier and will be diving into each of the steps above while we do so. Our first post focuses on the prompt engineering aspects, as well as defining a scope and building a simple evaluation harness.
On the other hand, if you read and you’re eager to get right in and start, we recommend checking out our free course on Building LLM Apps.
Till next time!

Iterate on AI agents and models faster. Try Weights & Biases today.