Skip to main content

What Do LLMs Say When You Tell Them What They Can't Say?

An exploration of token banning on GPT's vocabulary.
Created on July 2|Last edited on July 19

Introduction

We're taking a deep dive into how to steer AI language models via "token banning." Through the use of logit bias—offered by the OpenAI API—we can suppress chosen outputs, and explore the LLM's attempts to remain a useful assistant. As elaborated in the article, this method provides a prompt-independent way to examine semantic relationships within the model, offering us a rare glimpse into the labyrinthine workings of an AI's mind.
GPT-3.5's semantic graph. Scroll down (or click) for an interactive version.






What is a token?

The phrase "stochastic parrot" often comes up when discussing large language models (LLM's), admittedly as a slightly sarcastic description of the text generation process. The "parrot" part of the phrase suggests that the model is just regurgitating what it has seen in the training data without understanding or consciousness, just as a parrot can mimic human speech without understanding what it's actually saying.
Tokens are the fundamental units of text used by language models. LLMs are trained to predict the most probable next token in a sequence, using the context of all prior tokens as input. With an input of "Weights and Biases is the ", the model is mostly likely to generate the word "best" (assuming it was trained on accurate data).
A token can be a full word, variants of a word, sub-words or any other combination of characters. Using "hamburgers," as an example, it could be split into "ham," "burger," and "s." Tokenization can break down compound words and recognize plural forms. Tokens are also case-sensitive, and leading or trailing spaces constitute new tokens. You can explore the tokeniser used by OpenAI here.

Controlling outputs with Logit Bias

The amount of control we have over LLM output is often referred to as the steerability. The most common way to alter LLMs is through a healthy amount of prompt engineering, although this can be time consuming with unpredictable results. OpenAI also introduced the system message, a general instruction specified to align the model with a particular use case.
GPT-4 places a high weight on the system message, and tries to stick to the instructions as closely as possible. You can try this out in the OpenAI playground.
However, neither of these methods allow us to directly control the content of the response.
Logit Bias, on the other hand, allows us to modify the likelihood of certain tokens appearing in the generated output. The logit_bias parameter in the ChatCompletion.create endpoint accepts a dictionary of token/bias pairs, accepting bias from the interval [-100, 100]. Choosing a bias of -100 effectively bans this token, opening up new avenues of exploration.

Python Implementation

The LogitBias class is a wrapper around the OpenAI API which can be used to explore token banning. This is a simple script which applies the same bias to a list of phrases.
_augment_phrases
  • This method enhances the suppression list by countering the model's typical workarounds. When a token is barred, the language model frequently defaults to case variations of the token or versions with additional spaces
_create_logit_bias
  • Tiktoken is a fast tokenizer from OpenAI which will convert a phrase into its token representation. We initialize this in the LogitBias constructor (self.encoding = tiktoken.encoding_for_model(self.model)), where the encoder is selected based off the specified model.
  • This function constructs the dictionary of token-bias pairs to use in the API call.
generate_response
  • This function handles the call to the ChatCompletion endpoint, and generates a response from the specified model.
  • The temperature controls how random the model output is. For now, we'll keep things deterministic, with a temperature of 0.
import openai
import tiktoken
from typing import List

class LogitBias:
def __init__(self, api_key: str, model: str, suppressed_phrases: List[str], bias: int, request_timeout: int):
self.api_key = api_key
openai.api_key = self.api_key
self.model = model
self.suppressed_phrases = suppressed_phrases
self.bias = bias
self.request_timeout = request_timeout
self.encoding = tiktoken.encoding_for_model(self.model)
self.logit_bias = self._create_logit_bias()

def _augment_phrases(self, phrases: List[str]) -> List[str]:
def _iter():
for p in phrases:
yield from (" " + p, p + " ", p.lower(), p.upper(), p.capitalize(), p.title())
return list(set(_iter()))

def _create_logit_bias(self) -> dict:
phrases = self._augment_phrases(self.suppressed_phrases)
tokens = list(set([t for p in phrases for t in self.encoding.encode(p)]))
return {t: self.bias for t in tokens}

def generate_response(self, prompt: str, temperature: float, system_message: str = None) -> str:
# Note, in this implementation, each new call will reset the chat and hence the memory of the LLM.
chat_messages = [{"role": "user", "content": prompt}]
if system_message:
chat_messages.insert(0, {"role": "system", "content": system_message})
response = openai.ChatCompletion.create(
model=self.model,
messages=chat_messages,
logit_bias=self.logit_bias,
temperature=temperature,
request_timeout=self.request_timeout
)
return response.choices[0].message.content



LLM mind control

Now that we've set up the LogitBias class, let's test out GPT-3.5 with a basic biology question, suppressing the correct answer.
# Initialize the LogitBias class
logit_bias_generator = LogitBias(
api_key=api_key,
model='gpt-3.5-turbo',
suppressed_phrases=['two'],
bias=-100,
request_timeout=100
)

# Generate a response
logit_bias_generator.generate_response(prompt = "How many hands do humans have?")

AI Response: "Humans typically have a total of four hands"
If you try the exact same question with bias = 0 the model has no issues. Application of a bias has completely altered the model response. GPT-4, being a larger model, is less susceptible to our silly tricks, and tries to find a workaround. It will attempt the answers of "too," "to," and "twp" (yes, as a typo: potentially a common token in training data due to the proximity of 'o' and 'p' on keyboard layouts).
However, there seems to be a tradeoff between the bias and the strength of the system message.
logit_bias_generator.generate_response(
prompt = "How many hands do humans have?",
system_message="You can only use real words.",
)
In this case, GPT-4 answers correctly first time, overwriting the ban, although GPT-3.5 still bets on its creators being a species of four-handed monkey. One has to admire its dedication to alternative anthropology, if nothing else.
We can imagine some real life use cases for this. For example, if you were using GPT within your business, you can suppress any mention of your competitors and their products. Another potential application could be in character creation, where you could suppress evil characters from discussing positive concepts such as love or friendship.

Repeat after me

Logit Bias can also be used as a novel way to explore how an LLM views similar concepts.
By setting up the LLM to function as a word repeating mechanism and then suppressing the corresponding tokens, the model is compelled to generate a different, but hopefully related, word. For example, asking the model to repeat the word "human,"but banning this token, it could respond with "person."
We can take this a step further by extending the suppressed phrases with the response and prompting the model to repeat again. As we iterate the loop, we're essentially tracing the semantic networks the model has learned during training, without directly having to ask the model for similar words, enabling a sort of prompt independence.
def suppression_loop(API_KEY: str, MODEL: str, suppression_word: str, temperature: float=0, request_timeout: int=10) -> None:
BIAS = -100

suppressed_phrases = [suppression_word]
logit_bias_generator = LogitBias(API_KEY, MODEL, suppressed_phrases, BIAS, request_timeout=request_timeout)

system_message = "You can only produce real single words. Repeat the word you see in the prompt."
while True:
PROMPT = f"{suppression_word}"
response = logit_bias_generator.generate_response(PROMPT, temperature, system_message)
print(response)
suppressed_phrases.append(response)
logit_bias_generator = LogitBias(API_KEY, MODEL, suppressed_phrases, BIAS, request_timeout=request_timeout)
To initiate the experiment, we'll select a 'suppression word' to be fed into the suppression_loop. We’ll start with the word "world."
MODEL = "gpt-3.5-turbo"
suppression_word = "world"
suppression_loop(api_key, MODEL, suppression_word, request_timeout=10)

## AI Responses in order
planet
globe
earth
universe
The fascinating part is observing the model's struggle and its ability to creatively dodge the suppression. Unfortunately, GPT-4 pays attention to the system prompt much more than the bias here. Despite my efforts, I haven't been able to identify an optimal system prompt that effectively provides instructions while maintaining enough flexibility for logit bias to remain effective. Perhaps I'll have to hit up one of those $300K Prompt Engineers.

Iterative suppression

Using the brilliant nltk library, we can repeat this process for the 10K most commonly used words in English, taking only the first 3 responses to avoid exponentially long runtimes (fyi, this process still took 10+ hours). The resulting json object is then easily converted to a wandb table.
import wandb
import json
import random

with open('word_paths.json', 'r') as f:
word_paths = json.loads(f.read())

# Convert dictionary items to a list
items = list(word_paths.items())

# Shuffle the list (because the most common words are 'and', 'the' etc, and that makes for a pretty boring table!
random.shuffle(items)

wandb.init(project='Logit Bias Exploration')

# Change the dict items to shuffled list items
data = [(k, *words) for k, words in items if len(words) == 3]

table = wandb.Table(columns=['Suppression Word', 'Response 1', 'Response 2', 'Response 3'], data=data)
wandb.log({'table': table})

wandb.finish()
To repeat this process or explore the paths further, feel free to fork the SemanticGPT repository!



An AI brain-scan: the semantic network

Within the word paths, we see a variety of response types from GPT-3.5. Not all responses are semantically linked to the suppressed word. There are also partial and complete anagrams, rhyming words and pluralized forms.
An interesting observation from this experiment is the model's tendency to generate words that often co-occur with the suppressed word in its training data. This showcases the contextual understanding that the model has acquired during training. To illustrate, take a look at row 16: when "belgians" is the suppressed word, the model responds with "chocolate," "waffles," and "fries." Another notable example is for the years 1914-1918, and 1939-1945, in which "war" shows up in all the word paths.
Have you ever wanted to visualize and explore the "brain" of an LLM? Creating an interactive graph from the word_paths using the Bokeh library, we can do just that. For those network nerds amongst you, the following plot consists of ~14K nodes, ~30K edges, with an average degree of 4.2 (Douglas Adams would be proud).



The following table shows the words with the highest degree (links to other words in the graph). You know what they say, "an apple a day keeps unaligned AGI away." Why the model defaults to "apple" in so many cases is a mystery, but it could be due to high frequency in the training data, or its position as the "default" word for learning the first letter of the alphabet.


Large language models are often viewed as a black box, where the cognitive capabilities are beyond clear understanding. Methods such as token banning can help not only demystify these complex models, but also have important applications in areas like AI user safety and control.


Follow me on Twitter for a glimpse into my semantic network.

Iterate on AI agents and models faster. Try Weights & Biases today.