Skip to main content

OpenAI To Watermark GPT Models, Mitigating Potential Misuse

OpenAI is working to implement a watermarking scheme for GPT model output using a custom pseudorandom function when choosing from model output candidate tokens.
Created on December 12|Last edited on December 12
The ever-present question of language generation model misuse is at a head right now with ChatGPT's growing mainstream popularity. Because GPT models are made to produce realistic content, the potential for misuse is immense, whether it's through impersonation, spreading lies, or some other avenue of deceit.
There aren't many great solutions to this problem, however, though OpenAI is currently looking into a sort of watermarking solution to help determine whether any given piece of content was generated with their GPT models. This information comes from Scott Aaronson, an OpenAI guest researcher, who gave a lecture hosted by the Effective Altruist club at UT Austin around a month ago.

Watermarking GPT models with guided randomness

Text content cannot be watermarked in the same way that images can. Where images can be stamped with a translucent logo, the closest analog for text would be a short string injection which could be easily removed. Instead of taking a traditional approach to watermarking, OpenAI will take it a level deeper - straight into the content generation process.
GPT models work with language in the form of tokens; Input text is read as tokens, and outputs are provided as tokens. GPT models provide a list of tokens as output, each with an associated score, from which an external algorithm chooses a winner to produce as the next token output.
Said external algorithm is hosted on OpenAI's servers. The algorithm chooses a single winner from the list, weighted by their score, and incorporates some amount of randomness to keep output unique and varied. This randomness factor is exactly where OpenAI will implement the watermarking scheme.
Instead of taking some typical random function, OpenAI created their own with a key that only they can access. When selecting from the list of output tokens, the one which maximizes the randomness function would be selected as the next output. With this custom pseudorandom function, any given text string could be analyzed to find if it maximizes the function like GPT output would. Even if GPT output were lightly modified by a human, the average function maximization would still indicate a GPT creation; Any sort of heavy modification could still slip by, however.
The nice thing about this method of watermarking is that the output is still just as random and high quality as it was before from any human's perspective (unlike, for instance, the visual watermarking common on images which ruins its quality). Applying this idea of conceptual or guided randomness watermarking is also being investigated for DALL·E 2, though it's a lot trickier.

Find out more

Read or watch Scott Aaronson's lecture delving deeper into the theory behind this watermarking scheme at his website's blog post.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.