Skip to main content

Prompt Engineering Adventure

A general workflow for exploring LLMs and prompt versions with W&B
Created on December 5|Last edited on February 15
TL;DR: Showcase a general workflow for tracking, visually comparing, and evaluating different variants of generative models (bases and various branching/fine-tuning scenarios), datasets, training recipes (e.g. few-shot templates), prompts, and of course sample output. I fine-tune toy models from GPT2 in different scenarios using poems, songs, and articles.
Content note: this Report explores and links to random LLM-generated text, which may contain nonsense/adult/NSFW language.


Overall workflow

This generic prompt engineering project leverages W&B Tables, Artifacts, and Weave in a flexible workflow. The order of stages is fluid, and many more instantiations and supporting templates are possible from this initial sketch.

[click to expand] Explore by tuning, understand by prompting, refine, repeat

Base LLMs

I leverage two GPT2 versions for the base model: small and med.

Staring model selection & config

Finetuning data

I use three Kaggle text datasets for fine-tuning: 15.6K poems, 32K songs, and 337 Medium posts on AI/ML.

Dataset overview & config

15.6K poems

32K song lyrics

Only 337 Medium posts on AI/ML/data science

Finetuning recipes

Exploring model variants by changing mostly the composition of training data

Training regime & configuration

Initial tests: Poems & songs, proof of concept

Even balance: Poems & songs, balanced split (10K each)

Full "3-compose": Poems, songs, & articles

Finetuned models

Save several model variants, with the long-compose (generic_template) and 3-compose (style_specific_template) as the two most promising versions to evaluate further.

Exploring model versions

Sample prompts by text format

The initial prompt set contains more canonical titles based on the style/format of the text.

Prompt templates & initial titles

Evaluation strategies: Subjective for now; quantified soon

Detailed results comparison: Effect of style-specific prompt

Model registry: All notable models from this exploration

Artifacts lineage: Trace the full workflow

P.S. A few more interesting responses