Building a GenAI-assisted automatic story illustrator

Until September 27th Weights & Biases users can illustrate their own stories for free using Flux and GPT-4
Created on August 27|Last edited on September 21
Comment
﻿
IntroductionRecent advances in diffusion models have enabled text-to-image generation models to achieve drastic improvements. Open-source text-to-image generation models such as FLUX from Black Forest Labs, Stable Diffusion 3 from StabilityAI, and PixArt-Σ not only demonstrate higher quality of generated images but also improve factors such as spatial consistency in the images and the ability to follow text prompts accurately.
In this report, we'll exploit the improved prompt-following abilities of these diffusion models to build an LLM-assisted workflow to illustrate paragraphs from short stories automatically. We'll use OpenAI GPT-4 as our LLM, taking both the story and a paragraph from that story to generate a summary that captures the entire visual essence of the scene described in that paragraph. We would then use this summary as a prompt for FLUX.1-dev to generate an image depicting the scene. We'll also use Weave, a lightweight toolkit built by Weights & Biases, to track and evaluate our LLM application.
Click here to Illustrate Your Own Stories for Free!﻿
﻿﻿An illustrated version of "The Gift of the Magi"We used the automatic illustration workflow to illustrate the first ten paragraphs of the story The Gift of the Magi by O. Henry. Here are a few samples:
﻿
Illustrated edition of "The Gift of the Magi"1
﻿
Illustrate your own stories for FREE with just a few lines of python codeAll you need is a WandB API Key which you can get from wandb.ai/authorize. When you run this code, you will recieve a weave trace, which you can use to explore the images and the detailed trace of the entire workflow.
Until September 27th, 2024 Weights & Biases users can illustrate their own stories for free using Flux and GPT-4
💡
import requests
import os
﻿
URL = "http://195.242.25.198:8020/illustrate/story"
WANDB_API_KEY = "<YOUR-WANDB-API-KEY>"
﻿
﻿
illustrator_payload = dict(
    story="Once upon a time, in a land far, far away, there was a little girl named Alice. Alice loved to explore the world around her. One day, she decided to go on an adventure to find the most beautiful flower in the world. Alice traveled through forests, over mountains, and across rivers. She met many interesting creatures along the way, like a friendly dragon and a wise old owl. After a long journey, Alice finally found the most beautiful flower in the world. She picked it and brought it back to her home. From that day on, Alice knew that with courage and determination, she could achieve anything.",
    story_title="Alice's Flower Adventure",
    story_author="A.A. Milne",
    story_setting="A magical land filled with talking animals and enchanted creatures",
)
﻿
response = requests.post(URL, headers={"wandb-api-key": WANDB_API_KEY}, json=illustrator_payload)
print(response.json())
﻿
Alternatively, you can run this colab notebook
﻿
﻿
AcknowledgementThe idea of building a purely GenAI-assisted graphic novel generator comes from this tweet by Andrej Karpathy:
﻿
The code for the project is available here.
Here's what we'll be covering: ﻿﻿
Table of contentsIntroductionAn illustrated version of "The Gift of the Magi"Illustrate your own stories for FREE with just a few lines of python codeAcknowledgementTable of contentsArchitecture of the story illustratorResultsShortcomings and potential improvementsConclusion
﻿
﻿
Architecture of the story illustratorGiven the fact that we have access to quite powerful large language models besides text-to-image generation models that pretty accurate in following detailed prompts, writing an auto-illustration workflow might seem simple at a first glance. At the highest level, the problem that needs to be solved is to generate a summary from the given paragraph that captures the necessary clues to depict the scene visually. This summary can then be used as a prompt for a powerful text-to-image generation model like FLUX.1-dev.
However, illustrating a story (especially individual sequences from the story) is an extremely nuanced problem. While individual paragraphs can sometimes contain enough detail, more often, they're found in preceding (or subsequent) copy. For example, an author might describe how a character looks early in the story but won't belabor that every time that character is referenced. The story could very well be set in a fantastical or fictional reality where the way things look and appear might not be something that the LLM is familiar with, but a paragraph of dialog may not offer much direction.
In the report, we create propose the following architecture for effectively generating illustrations from individual paragraphs or segments from a short story:
﻿
The architecture of the story illustrator3
﻿
﻿
ResultsWe used the automatic illustration workflow to illustrate the first ten paragraphs of the story The Gift of the Magi by O. Henry in two different styles. Using GPT-4 for the LLM workflow costs about $2.50 for ten illustrations using a single style.
﻿
"The Gift of the Magi" illustrated in hyper realistic photograph, photography of a, 50 mm, film grain, Kodak portra 8001
﻿
And here, in the style we showcased above: 
﻿
"The Gift of the Magi" illustrated in surreal style, artstation, digital art, illustration1
﻿
Shortcomings and potential improvementsAutomatically generating illustrations corresponding to story segments is an extremely nuanced problem to solve, especially if we want to solve it in a model-agnostic manner. In this section we list several shortcomings of the existing architecture and possible ways to mitigate them.
Currently the character profiles are created on paragarph/story-segment basis. This is not an optimal approach, as it wastes a lot of redundant LLM calls for characters that are present at multiple points in the story across several segments. An easy way to solve this problem is to search the story once for all the characters and store their profiles in the memory. This not only reduces the expenditure in terms of LLM calls, but also makes the character profiles coherent across the entire story. Note that the dress and looks of a character might change in the story, which means we will still need to look for these changes with respect to the characters in each paragraph/story-segment.
For generating both character profiles and the summary we're currently passing the entire story as context to the LLM call. This approach is suboptimal and makes the entire workflow slow and expensive. This could be optimized using a retrieval augmented generation (RAG) workflow where only those segments from the story that are relevant to the current character profile or summary is retrieved and used as context. An advanced retreival technique such as Recursive Abstractive Processing for Tree Organized Retrieval (RAPTOR) could potentially be used to capture both high level and detailed aspects of text, which is particularly useful for complex thematic queries used in this workflow. This would also enable us to reliably scale the auto-illustration architecture for larger stories and novels that would not fit into the context length of the LLM.
If you want to learn how to create and evaluate a simple RAG pipeline, check out the following reports:
Building an AI teacher's assistant using LlamaIndex and Groq
Today, we're going to leverage a RAG pipeline to create an AI TA capable of helping out with grading, questions about a class syllabus, and more
﻿
The current architecture often misses to include important clues from the paragraph/story-segment such as the story being set during Christmas (in spite of it being explicitly mentioned in the text). This could be solved by prompting explicitly to look for specific visual details, such as what season it is now, or what time of the year it is, or even what time of the day is being mentioned here to ensure details like these are not overlooked in the summary.
Assigning roles or a persona to the system prompt of an LLM is an important technique that can be used to increase the accuracy of its response by asking to respond in a certain style. For example, for the summarization step in the InContextTextToImagePromptGenerator, we ask LLM to play the role of a "helpful assistant to a visionary film director". Similar role assignment could be used in other steps, such as the entity recognition or the character profiling steps to make responses more accurate.
Generating consistent faces corresponding to the characters across all the illustrations is another challenge that needs to be tackled.  The cheapest way to achieve this is to "cast" the characters present in the story to an existing famous celebrity/actor whose likeness is known to the text-to-image generation model. This would require some clever post-processing of the summary generated by the InContextTextToImagePromptGenerator such that all references to a particular character with that of the cast. Another way to generate consistent faces more reliably is to use DreamBooth to personalize your text-to-image generation model using 3-5 images of a subject (check this script for DreamBooth with FLUX). If you're using a text-to-image generation service like MidJourney, you can use the character reference feature.
ConclusionIn this report, we discuss the architecture of a simple story illustration workflow that can automatically illustrate segments from a story using a combination of an LLM workflow and a text-to-image generation model. Although we use OpenAI GPT-4 for the LLM workflow and FLUX.1-dev to generate the illustration, the architecture is mean to be model-agnostic, i.e, the LLMs and image generation models can be replaced with other models or services with similar capabilities. We also discuss possible shortcomings in the current architecture and how they can be addressed to build a more robust and scaleable system to generate graphic novels from stories and novels.
How to create a biomedical RAG application using Snowflake Arctic for PubMed paper understanding
A tutorial about building a RAG application to better understand a large corpus of medical information
Prompt upsampling for diffusion models
This article shows the implementation of an LLM-assisted prompt upsampling strategy to improve the quality of images generated by Stable Diffusion.
Is the new Cerebras API the fastest LLM service provider?
Let's compare five different Llama 70B providers and run some benchmarks. We'll be looking at Cerebras, Groq, Together, Fireworks, and Octo.
How to optimize LLM workflows using DSPy and W&B Weave
Learn how to use DSPy teleprompters and Weave to automatically optimize prompting strategies for causal reasoning
Building a real-time answer engine with Llama 3.1 405B and W&B Weave
Infusing llama 3.1 405B with internet search capabilities!! 
Refactoring Wandbot—our LLM-powered document assistant—for improved efficiency and speed
This report tells the story of how we utilized auto-evaluation-driven development to enhance both the quality and speed of Wandbot.
﻿
﻿
Add a comment
Tags: Articles, GenAI, Image Generation, Tutorial, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.
Building a GenAI-assisted automatic story illustrator

Introduction

﻿﻿An illustrated version of "The Gift of the Magi"

Illustrate your own stories for FREE with just a few lines of python code

Acknowledgement

Table of contents

Architecture of the story illustrator

Results

Shortcomings and potential improvements

Conclusion

An illustrated version of "The Gift of the Magi"