Claypot.ai Co-founder Outlines Challenges of LLM's in Production

Chip Huyen goes into depth on all of the challenges of LLM's in production
Created on June 9|Last edited on June 9
Comment
The surge of interest in Large Language Models (LLMs) in machine learning workflows is reflective of their seemingly infinite potential. However, Chip Huyen, a veteran in the field and co-founder of Claypot.ai, observes that while creating something intriguing with LLMs is relatively simple, the task of transitioning these models into production-ready tools presents considerable hurdles. Huyen pinpoints a few key challenges that currently plague the implementation of LLMs and we will go into some interesting challenges that Huyen has identified! 
Handling Ambiguity The ambiguity in the generated outputs can prove to be a significant issue for downstream applications, which expect and depend on outputs in a specific format for parsing. Crafting prompts that explicitly define the output format can help, but the uncertainty of language means there is no absolute guarantee that the outputs will consistently adhere to the defined format.
Another challenging aspect is the inconsistency in user experience. Just as one would expect a certain level of consistency when interacting with an application (such as an insurance company providing consistent quotes), the stochastic nature of LLMs introduces unpredictability.
For the same input, LLMs could generate varying outputs. This problem can be partially mitigated by setting the temperature parameter to zero in the LLM, enforcing more consistency in the responses. However, it doesn't inspire much confidence in the system, akin to a teacher who provides consistent scores only if they're situated in a particular room.
As part of the solution to address ambiguity issues, a common practice in prompt engineering involves providing a few examples within the prompt, hoping that the LLM will generalize from these examples. This approach is referred to as 'fewshot learning.’ 
﻿
﻿
﻿
When employing fewshot learning, Huyen proposes two key considerations: understanding whether the LLM comprehends the examples provided in the prompt and identifying if the LLM overfits to these fewshot examples.
One way to evaluate the former is to input the same examples and see if the model outputs the expected scores. If the model fails to do so, it could indicate that the prompt isn't clear enough, necessitating a rewrite of the prompt or decomposition of the task into smaller tasks. As for the latter, the model can be evaluated using separate examples. Additionally, Huyen found it useful to have the model generate examples of texts for a specific label, such as a score of 4, and then input these examples into the LLM to verify if it indeed outputs the expected score.
Organization and Optimization Prompt versioning, optimization, and the understanding of cost and latency are critical aspects of large language model implementation. Here's more on these issues:
Prompt Versioning Prompts play a crucial role in the behavior of LLMs. Even minor changes to a prompt can significantly alter the output. Consequently, it is vital to version each prompt and track its performance over time, similar to how we version software or data in machine learning projects.
Prompt Optimization Numerous strategies have been proposed to optimize prompts. Some methods suggested by OpenAI and others include:
1. Using a Chain-of-Thought (COT) technique where the model is prompted to explain or describe its thought process step-by-step.
2. Generating multiple outputs for the same input, and selecting the final output through majority vote (self-consistency technique) or asking the LLM to pick the best one.
3. Breaking a large, complex prompt into smaller, simpler ones.
Cost and Latency Cost and latency are key considerations in LLM usage. As more detailed prompts often lead to better model performance, they also increase the cost of inference. OpenAI charges for both input and output tokens, which means that longer prompts can quickly inflate costs. For instance, a simple prompt might be anything between 300 - 1000 tokens, but adding more context or detail can easily increase that to 10,000 tokens. While experimenting with prompt engineering can be relatively cheap and quick, the real cost lies in inference. The process of making prompts shorter while maintaining performance could be a lucrative effort, and is definitely an interesting challenge of modern production LLM’s. 
Fine-tuning vs. Prompting vs. Distillation In the context of using Large Language Models like GPT-3 and GPT-4, the three main methods used are prompting, fine-tuning, and distillation. Each method varies in terms of data availability, performance, and cost.
﻿
﻿
Prompting Prompting involves giving the LLM a specific request each time you want a response. It's the most straightforward way to use an LLM and can be effective when you have a small number of examples. However, the effectiveness of prompting is constrained by the maximum input token length, which limits the number of examples you can provide. From a cost perspective, prompting can be expensive if the prompts are lengthy or if a large number of predictions are needed. Nevertheless, it's an accessible method that doesn't require additional training.
﻿
Fine-Tuning Fine-tuning involves training the LLM on your task-specific data. The benefit is that the model learns to respond appropriately without needing explicit prompts each time. If you have hundreds or thousands of examples, fine-tuning can potentially lead to better model performance than prompting. A study by Scao and Rush (2021) found that a prompt is roughly worth about 100 examples, meaning that after about 100 examples, fine-tuning starts to outperform prompting. But the performance can vary significantly depending on the task and the model. In terms of cost, fine-tuning can be cheaper in the long run as it reduces the need for lengthy prompts. The more task-specific information the model has been fine-tuned on, the less instruction it needs at runtime, which can significantly reduce prediction costs.
﻿
Distillation Distillation is a method of training a smaller model to imitate the behavior of a larger, more complex one. This technique can be beneficial if you want a model that behaves like a more complex one but with lower computational costs. For example, a group of Stanford students in March 2023 successfully fine-tuned a smaller, open-source LLM (LLaMa-7B, with 7 billion parameters) on examples generated by a larger LLM (text-davinci-003, with 175 billion parameters). The resulting model performed similarly to the larger model but was significantly cheaper to run. This distillation process involved generating a set of instructions from the larger model and then fine-tuning the smaller model on these outputs. The generation of instructions cost under $500, and the fine-tuning process cost under $100, making it a cost-effective approach.
﻿
Conclusion In the age of Large Language Models, Chip Huyen's insights offer an indispensable look into the real-world implementation. While the potential of LLMs is considerable, Huyen's work sheds light on significant challenges. Key among these are ambiguity in output, the complexity of prompt versioning and optimization, as well as cost and latency considerations, and finally the tradeoffs between prompting, fine-tuning, and distillation. These insights from Huyen underscore the importance of thorough planning and strategic decision-making when introducing these powerful models into practical applications. 
﻿
﻿
﻿
﻿
Original Article: https://huyenchip.com/2023/04/11/llm-engineering.html#backward_and_forward_compatibility﻿
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.