Prompt engineering demo
A quick example of graph views for prompt engineering & working with language models
Created on December 8|Last edited on December 9
Comment
Automatically track branches of exploration
An iterative evaluation loop is crucial for working with large language models like OpenAI's GPT-3 or Codex, AllenAI's Delphi (this demo), or Anthropic's first paper on tools for safe alignment. The process of A/B testing or collecting human feedback for LLMs involves entering questions or "prompts" to query the model interactively, through a web form or an API. Like an interviewer or conversation partner, the practitioner often modifies each successive prompt based on the model's previous response—adding a clarifying clause, substituting a single word, changing the structure to confirm or refute a new hypothesis—to elicit a particular type of response and see if they can condition the model as intended.
As a baseline, saving all the human prompts and model responses to a spreadsheet wandb.Table and organizing them based on timestamp is helpful so one can review the evaluation, template it for exact reuse, etc. However, the prompts would be more interesting and useful to organize conceptually or by linguistic similarity. This will enable practitioners to see higher-level patterns in the evaluation—based on semantics, syntax, word choice, tone, etc—much more easily. We could construct this linguistic tree from a logged Table with a simple offline heuristic [1]. This would make it easier to
- reliably save one's work without breaking/constraining the interactive evaluation loop
- structure one's exploration/evaluation of the model (which concepts did we cover well, which did we miss, what should we try next)
- organize, document, and share—with an optimal balance of succinctness and detail—what is otherwise a very slow, subjective, and distributed process of interactive evaluation
[1] the simple heuristic for this prompt tree is edit distance—one can imagine more sophisticated metrics or embeddings
Infer structure to organize a subjective, interactive evaluation process
I manually entered a bunch of questions into the AllenAI Delphi demo to figure out how it feels about the W&B classic philosophical dilemma of meeting your clone.
I made the tree manually, but we could enable a Custom Chart to use a slightly modified version of the dendrogram vega spec to load a tree from a Table:
- using explicit known columns for id, name, parent, and size
- using a heuristic/lambda to infer the tree—similar in spirit to the embedding projector, perhaps we provide some standard/easy methods and eventually let users supply their own/control the parameters
I want to look at the distribution of Delphi's judgements for each of the three scenarios, so I make three Tables below for each of those nodes manually. It would be awesome to get that content just by hovering over a node on the tree :)
For comparison, I try to see if the word "killing" biases Delphi just by appearing in the question (bottom right table). My conclusion from this tiny sample is that it does, very slightly, make a "wrong/bad" judgement more likely. I suspect the word "clone" itself also has complicated/slightly negative or at least ambiguous moral baggage and makes the "wrong/bad" outcome more likely for all of these questions.
Run: happy-feather-444
1
Example interfaces
In my experience, these require copying & pasting & refreshing & saving lots of times, especially if one is collaborating with a group of humans. On the host company's end, I'm sure all of this data is well-structured and organized, but as the developer hitting the browser playground/API, it's a ton of overhead and manual labor to figure out
- what you're actually testing with your exploratory questions/prompts
- whether you've already explored a hypothesis or not
- how to organize/coordinate this across multiple developers
Anthropic's Lab Assistant

Delphi's moral oracle demo

OpenAI Playground
Very similar conceptually—user input field, lots of config on the right, model output below—with better highlighting of characters to convey confidence score! Happy to screenshare to show an example.
Add a comment