Skip to main content

A Gentle Introduction to LLM APIs

In this article, we dive into how large language models (LLMs) work, starting with tokenization and sampling, before exploring how to use them in your applications.
Created on June 13|Last edited on June 17
This article is based on W&B free online course Building LLM-Powered Applications. It is illustrated with slides and code snippets discussed in the course. Sign up today and start learning!
💡

What Is an LLM?

Large language models (LLMs), such as GPT-4, represent the forefront of natural language processing and machine learning technology. As detailed in the GPT-4 paper, these models are essentially transformers trained to predict the next token (i.e., word or subword) in a given piece of text.
This kind of model architecture is known as autoregressive, and it allows the model to generate coherent and contextually relevant text.

The "large" in large language models refers to the massive number of parameters these models have. For instance, GPT-3, the predecessor of GPT-4, has 175 billion parameters! Essentially, each parameter is a component of the model learned from training data and the large number of these parameters allows LLMs to capture a vast array of linguistic patterns and nuances, contributing to their impressive capabilities.
One of the key characteristics of LLMs is that their output is a probability distribution over the next token in a text. This means that, given a piece of text, the model can provide a set of potential next tokens each with an associated probability. The model's prediction is typically the token with the highest probability, but different sampling strategies can be used to introduce randomness and creativity in the output.

LLMs are trained on massive datasets comprising a wide variety of internet text. This vast amount of training data enables them to generate human-like text across a diverse range of topics and styles.
Today, these powerful models are accessible through APIs like the ones provided by OpenAI. This allows developers to integrate advanced natural language processing capabilities into their applications without needing to train these large and complex models themselves.
The potential use cases for LLMs are vast and varied. They're used for tasks such as text generation, translation, summarization, question-answering, and much more. Additionally, they can be used to create chatbots, write code, generate creative content like poems and stories, assist in learning and education, and even for more advanced applications such as drafting emails or writing articles. The ability of these models to understand and generate human-like text opens up a world of possibilities for natural language understanding and generation tasks.

What Is Tokenization?

Tokenization is a fundamental step in natural language processing (NLP) and machine learning that involves breaking down human language into smaller parts or "tokens." These tokens can be as small as individual characters or as large as entire sentences.
More commonly, they correspond to words or subwords, as they allow for a more balanced trade-off between preserving linguistic information and maintaining a manageable number of tokens.
The reason tokenization matters so much in NLP and ML is because computers do not understand human language like we do. For a computer to process text, it needs to convert that text into numerical representations that it can understand and manipulate. Each unique token in the text is assigned a unique numerical identifier. Thus, the process of tokenization transforms a string of human-readable text into a sequence of machine-readable numbers.
OpenAI's tiktoken library is a Python library that can be used to perform tokenization in a manner consistent with how OpenAI's models (like GPT-3) do it. It provides an easy-to-use API for encoding human-readable text into tokens and decoding tokens back into human-readable text.
Let's illustrate this with a code snippet:
import tiktoken

# Load the tokenizer model for OpenAI's davinci model
tokenizer = tiktoken.encoding_for_model("text-davinci-003")

# Encode a string of text
enc = tokenizer.encode("Weights & Biases is awesome!")

# Print the encoded tokens
print(enc)

# Decode the tokens back into text
print(tokenizer.decode(enc))

# Decode tokens one by one
for token_id in enc:
print(f"{token_id}\t{tokenizer.decode([token_id])}")
In this code, we first load the tokenizer model that corresponds to the "text-davinci-003" model from OpenAI. We then encode a string of text ("Weights & Biases is awesome!") into tokens, represented as a list of numbers. We then print these encoded tokens.
Here's the output:
[1135, 2337, 1222, 8436, 1386, 318, 7427, 0]
Weights & Biases is awesome!
Following that, we decode the entire list of tokens back into the original text to verify the encoding and decoding process. Then, we iterate over each token in the encoded list and decode them one by one. This serves to illustrate that each token in the list corresponds to a specific part of the original text. In each iteration, we print the numerical identifier of the token and the part of the original text that it represents.
1135 We
2337 ights
1222 &
8436 Bi
1386 ases
318 is
7427 awesome
0 !
We can see here that for example the token ! corresponds to number 0 and is the first token in this model's vocabulary. You can also note that some tokens contain leading spaces.

What Is Sampling?

LLMs are autoregressive models, which means they generate sequences one element at a time. Starting with an initial prompt, these models predict the next word by sampling from the probability distribution of possible words. This process is then repeated, with the model consuming its previous output as input for the next step, until it generates the desired length of text.
In other words, to get any output from an autoregressive LLM, we must engage in sampling, as it's inherent to how these models operate. The different sampling strategies—such as greedy decoding, beam search, temperature sampling, and top-p sampling—represent the variety of methods we can use to draw from the model's predicted word distributions. These strategies directly impact the trade-off between randomness and predictability in the model's output.
Greedy decoding involves choosing the word with the highest probability as the next word. While this method is simple and computationally efficient, it can lead to repetitiveness and a lack of diversity in the generated text. Beam search, on the other hand, keeps track of multiple sequences (the "beam" refers to the number of sequences kept) and expands all of them at each step. Although beam search can alleviate some of the issues with greedy decoding, it can still lead to suboptimal results due to its deterministic nature.

To introduce more randomness and creativity in the generated text, we can use stochastic sampling methods, such as sampling with temperature and top-p sampling.
Sampling with temperature involves adjusting the "sharpness" of the probability distribution before drawing a sample from it. A temperature close to zero makes the distribution sharper, meaning the model is more likely to choose the word with the highest probability. As the temperature increases, the distribution becomes flatter, and the model is more likely to choose a less probable word. In other words, higher temperatures lead to more randomness and diversity in the generated text.

Top-p sampling, also known as nucleus sampling, introduces randomness in another way. Instead of considering all possible next words, top-p sampling only considers the smallest set of top words whose cumulative probability exceeds a certain threshold, p. A higher value of p means more words are considered, leading to more randomness in the generated text.

We can experiment with different temperature and top-p values with the provided code snippet.
import openai

def generate_text(temp=None, topp=None):
"Generate text with a given temperature or top-p, but not both"
if temp is not None and topp is not None:
raise ValueError("Only one of temperature or top-p should be set")
response = openai.Completion.create(
model="text-davinci-003",
prompt="Say something about Weights & Biases",
max_tokens=50,
temperature=temp,
top_p=topp,
)
return response.choices[0].text.strip()

# Generate text with varying temperatures
for temp in [0, 0.5, 1, 1.5, 2]:
print(f'TEMP: {temp}, GENERATION: {generate_text(temp=temp)}')

# Generate text with varying top-p values
for topp in [0.01, 0.1, 0.5, 1]:
print(f'TOP_P: {topp}, GENERATION: {generate_text(topp=topp)}')
If you compare the outputs at low and high temperature values, you can see how higher temperature may lead to text that may not be reliable, but is definitely more diverse.
TEMP: 0, GENERATION: Weights & Biases is an amazing tool for tracking and
analyzing machine learning experiments. It provides powerful visualizations
and insights into model performance, enabling data scientists to quickly
identify areas of improvement and optimize their models.

TEMP: 1.5, GENERATION: Weights & Biases is a powerful tool that assists with
early stage project showcase and analysis, helping designers, data
scientists, and product/ decision teams discover their highest potential
creative insights. With automated experiment tracking, visual presentation,
and team plotting
It's important to note that OpenAI recommends against adjusting both temperature and top-p at the same time. This is because they both control the randomness of the output in different ways and adjusting them simultaneously can lead to unpredictable results. Instead, you should choose one of the two parameters to adjust based on the characteristics you want in your generated text.

Chat API

Moving to more powerful models like GPT-3.5 and GPT-4, OpenAI provides a different API: the Chat API. Instead of working with a single string prompt, this API works with a sequence of messages. Each message in the sequence has a role and content. The role can be "system," "user," or "assistant," while the content is the text of the message from that role.
The "system" role is typically used to set the behavior of the assistant at the beginning of the conversation. Messages with the "user" role are instructions or queries from the user to the assistant. Messages with the "assistant" role are responses from the assistant. The model considers all the messages in the sequence when generating a response.
Here's an example of how to use the Chat API:
import openai

MODEL = "gpt-3.5-turbo"
response = openai.ChatCompletion.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say something about Weights & Biases"},
],
temperature=0,
)

response
In this code, we first define the model we want to use, in this case, gpt-3.5-turbo. We then call openai.ChatCompletion.create to generate a response. The messages parameter is a list of message objects. Each object has a "role" and "content". We start with a system message that instructs the assistant to be helpful, followed by a user message asking the assistant to say something about Weights & Biases. The temperature parameter is set to 0, which means the model will deterministically choose the most likely next word at each step.
The response from the model is in the form of a ChatCompletion object, which includes the model's message amongst other information.
<OpenAIObject chat.completion id=chatcmpl-2534523452345 at 0x12ca7cf90> JSON: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "Weights & Biases is a machine learning platform that helps data scientists and machine learning engineers track and visualize their experiments. It provides tools for experiment management, hyperparameter tuning, and model visualization, making it easier to iterate and improve machine learning models. Weights & Biases also offers integrations with popular machine learning frameworks like TensorFlow, PyTorch, and Keras.",
"role": "assistant"
}
}
],
"created": 1686671355,
"id": "chatcmpl-2534523452345",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion",
"usage": {
"completion_tokens": 74,
"prompt_tokens": 27,
"total_tokens": 101
}
}


Conclusions

In summary, LLMs like GPT-4 represent a significant advancement in artificial intelligence. These models, with their large number of parameters and autoregressive architecture, are capable of understanding and generating human-like text across a diverse range of topics and styles. By leveraging APIs like those provided by OpenAI, developers can integrate these advanced capabilities into their applications, opening up a world of possibilities for natural language understanding and generation tasks.
Interested in diving deeper into the world of LLMs? Weights & Biases is offering a free online course: Building LLM-powered Applications. This course will guide you through the entire process of designing, experimenting, and evaluating LLM-based apps.
Sign up for the course today and start your journey towards mastering the art of building applications powered by Large Language Models.
Iterate on AI agents and models faster. Try Weights & Biases today.