Skip to main content

Building a GenAI-assisted automatic story illustrator

Until September 27th Weights & Biases users can illustrate their own stories for free using Flux and GPT-4
Created on August 27|Last edited on September 21

Introduction

Recent advances in diffusion models have enabled text-to-image generation models to achieve drastic improvements. Open-source text-to-image generation models such as FLUX from Black Forest Labs, Stable Diffusion 3 from StabilityAI, and PixArt-Σ not only demonstrate higher quality of generated images but also improve factors such as spatial consistency in the images and the ability to follow text prompts accurately.
In this report, we'll exploit the improved prompt-following abilities of these diffusion models to build an LLM-assisted workflow to illustrate paragraphs from short stories automatically. We'll use OpenAI GPT-4 as our LLM, taking both the story and a paragraph from that story to generate a summary that captures the entire visual essence of the scene described in that paragraph. We would then use this summary as a prompt for FLUX.1-dev to generate an image depicting the scene. We'll also use Weave, a lightweight toolkit built by Weights & Biases, to track and evaluate our LLM application.
Click here to Illustrate Your Own Stories for Free!


An illustrated version of "The Gift of the Magi"

We used the automatic illustration workflow to illustrate the first ten paragraphs of the story The Gift of the Magi by O. Henry. Here are a few samples:

Illustrated edition of "The Gift of the Magi"
1


Illustrate your own stories for FREE with just a few lines of python code

All you need is a WandB API Key which you can get from wandb.ai/authorize. When you run this code, you will recieve a weave trace, which you can use to explore the images and the detailed trace of the entire workflow.
Until September 27th, 2024 Weights & Biases users can illustrate their own stories for free using Flux and GPT-4
💡
import requests
import os

URL = "http://195.242.25.198:8020/illustrate/story"
WANDB_API_KEY = "<YOUR-WANDB-API-KEY>"


illustrator_payload = dict(
story="Once upon a time, in a land far, far away, there was a little girl named Alice. Alice loved to explore the world around her. One day, she decided to go on an adventure to find the most beautiful flower in the world. Alice traveled through forests, over mountains, and across rivers. She met many interesting creatures along the way, like a friendly dragon and a wise old owl. After a long journey, Alice finally found the most beautiful flower in the world. She picked it and brought it back to her home. From that day on, Alice knew that with courage and determination, she could achieve anything.",
story_title="Alice's Flower Adventure",
story_author="A.A. Milne",
story_setting="A magical land filled with talking animals and enchanted creatures",
)

response = requests.post(URL, headers={"wandb-api-key": WANDB_API_KEY}, json=illustrator_payload)
print(response.json())

Alternatively, you can run this colab notebook



Acknowledgement

The idea of building a purely GenAI-assisted graphic novel generator comes from this tweet by Andrej Karpathy:

The code for the project is available here.
Here's what we'll be covering: 

Table of contents




Architecture of the story illustrator

Given the fact that we have access to quite powerful large language models besides text-to-image generation models that pretty accurate in following detailed prompts, writing an auto-illustration workflow might seem simple at a first glance. At the highest level, the problem that needs to be solved is to generate a summary from the given paragraph that captures the necessary clues to depict the scene visually. This summary can then be used as a prompt for a powerful text-to-image generation model like FLUX.1-dev.
However, illustrating a story (especially individual sequences from the story) is an extremely nuanced problem. While individual paragraphs can sometimes contain enough detail, more often, they're found in preceding (or subsequent) copy. For example, an author might describe how a character looks early in the story but won't belabor that every time that character is referenced. The story could very well be set in a fantastical or fictional reality where the way things look and appear might not be something that the LLM is familiar with, but a paragraph of dialog may not offer much direction.
In the report, we create propose the following architecture for effectively generating illustrations from individual paragraphs or segments from a short story:

The architecture of the story illustrator
3



Results

We used the automatic illustration workflow to illustrate the first ten paragraphs of the story The Gift of the Magi by O. Henry in two different styles. Using GPT-4 for the LLM workflow costs about $2.50 for ten illustrations using a single style.

"The Gift of the Magi" illustrated in hyper realistic photograph, photography of a, 50 mm, film grain, Kodak portra 800
1

And here, in the style we showcased above:

"The Gift of the Magi" illustrated in surreal style, artstation, digital art, illustration
1


Shortcomings and potential improvements

Automatically generating illustrations corresponding to story segments is an extremely nuanced problem to solve, especially if we want to solve it in a model-agnostic manner. In this section we list several shortcomings of the existing architecture and possible ways to mitigate them.
  • Currently the character profiles are created on paragarph/story-segment basis. This is not an optimal approach, as it wastes a lot of redundant LLM calls for characters that are present at multiple points in the story across several segments. An easy way to solve this problem is to search the story once for all the characters and store their profiles in the memory. This not only reduces the expenditure in terms of LLM calls, but also makes the character profiles coherent across the entire story. Note that the dress and looks of a character might change in the story, which means we will still need to look for these changes with respect to the characters in each paragraph/story-segment.
  • For generating both character profiles and the summary we're currently passing the entire story as context to the LLM call. This approach is suboptimal and makes the entire workflow slow and expensive. This could be optimized using a retrieval augmented generation (RAG) workflow where only those segments from the story that are relevant to the current character profile or summary is retrieved and used as context. An advanced retreival technique such as Recursive Abstractive Processing for Tree Organized Retrieval (RAPTOR) could potentially be used to capture both high level and detailed aspects of text, which is particularly useful for complex thematic queries used in this workflow. This would also enable us to reliably scale the auto-illustration architecture for larger stories and novels that would not fit into the context length of the LLM.
If you want to learn how to create and evaluate a simple RAG pipeline, check out the following reports:

  • The current architecture often misses to include important clues from the paragraph/story-segment such as the story being set during Christmas (in spite of it being explicitly mentioned in the text). This could be solved by prompting explicitly to look for specific visual details, such as what season it is now, or what time of the year it is, or even what time of the day is being mentioned here to ensure details like these are not overlooked in the summary.
  • Assigning roles or a persona to the system prompt of an LLM is an important technique that can be used to increase the accuracy of its response by asking to respond in a certain style. For example, for the summarization step in the InContextTextToImagePromptGenerator, we ask LLM to play the role of a "helpful assistant to a visionary film director". Similar role assignment could be used in other steps, such as the entity recognition or the character profiling steps to make responses more accurate.
  • Generating consistent faces corresponding to the characters across all the illustrations is another challenge that needs to be tackled. The cheapest way to achieve this is to "cast" the characters present in the story to an existing famous celebrity/actor whose likeness is known to the text-to-image generation model. This would require some clever post-processing of the summary generated by the InContextTextToImagePromptGenerator such that all references to a particular character with that of the cast. Another way to generate consistent faces more reliably is to use DreamBooth to personalize your text-to-image generation model using 3-5 images of a subject (check this script for DreamBooth with FLUX). If you're using a text-to-image generation service like MidJourney, you can use the character reference feature.

Conclusion

In this report, we discuss the architecture of a simple story illustration workflow that can automatically illustrate segments from a story using a combination of an LLM workflow and a text-to-image generation model. Although we use OpenAI GPT-4 for the LLM workflow and FLUX.1-dev to generate the illustration, the architecture is mean to be model-agnostic, i.e, the LLMs and image generation models can be replaced with other models or services with similar capabilities. We also discuss possible shortcomings in the current architecture and how they can be addressed to build a more robust and scaleable system to generate graphic novels from stories and novels.

Iterate on AI agents and models faster. Try Weights & Biases today.