Prompt Engineering Adventure
A general workflow for exploring LLMs and prompt versions with W&B
Created on December 5|Last edited on February 15
Comment
TL;DR: Showcase a general workflow for tracking, visually comparing, and evaluating different variants of generative models (bases and various branching/fine-tuning scenarios), datasets, training recipes (e.g. few-shot templates), prompts, and of course sample output. I fine-tune toy models from GPT2 in different scenarios using poems, songs, and articles.
Content note: this Report explores and links to random LLM-generated text, which may contain nonsense/adult/NSFW language.
Overall workflow[click to expand] Explore by tuning, understand by prompting, refine, repeatBase LLMsStaring model selection & configFinetuning dataDataset overview & config15.6K poems32K song lyricsOnly 337 Medium posts on AI/ML/data scienceFinetuning recipesTraining regime & configurationInitial tests: Poems & songs, proof of conceptEven balance: Poems & songs, balanced split (10K each)Full "3-compose": Poems, songs, & articlesFinetuned modelsExploring model versionsSample prompts by text formatPrompt templates & initial titlesEvaluation strategies: Subjective for now; quantified soonDetailed results comparison: Effect of style-specific promptModel registry: All notable models from this explorationArtifacts lineage: Trace the full workflowP.S. A few more interesting responses
Overall workflow
This generic prompt engineering project leverages W&B Tables, Artifacts, and Weave in a flexible workflow. The order of stages is fluid, and many more instantiations and supporting templates are possible from this initial sketch.
[click to expand] Explore by tuning, understand by prompting, refine, repeat
Base LLMs
I leverage two GPT2 versions for the base model: small and med.
Staring model selection & config
Finetuning data
I use three Kaggle text datasets for fine-tuning: 15.6K poems, 32K songs, and 337 Medium posts on AI/ML.
Dataset overview & config
15.6K poems
32K song lyrics
Only 337 Medium posts on AI/ML/data science
Finetuning recipes
Exploring model variants by changing mostly the composition of training data
Training regime & configuration
Initial tests: Poems & songs, proof of concept
Even balance: Poems & songs, balanced split (10K each)
Full "3-compose": Poems, songs, & articles
Finetuned models
Save several model variants, with the long-compose (generic_template) and 3-compose (style_specific_template) as the two most promising versions to evaluate further.
Exploring model versions
Sample prompts by text format
The initial prompt set contains more canonical titles based on the style/format of the text.
Prompt templates & initial titles
Evaluation strategies: Subjective for now; quantified soon
Detailed results comparison: Effect of style-specific prompt
Model registry: All notable models from this exploration
Artifacts lineage: Trace the full workflow
P.S. A few more interesting responses
Add a comment