Skip to main content

Prompt upsampling for diffusion models

This article shows the implementation of an LLM-assisted prompt upsampling strategy to improve the quality of images generated by Stable Diffusion.
Created on August 9|Last edited on August 13

🎬 Introduction

Recent advances in diffusion models have allowed text-to-image generation models to achieve drastic performance improvements. These advances have paved the way for training large text-to-image generation models such as Stable Diffusion XL, PixArt-Σ, and, more recently, the Flux series of models from Black Forest Labs, which are capable of generating images rapidly approaching the quality of photographs and artwork that humans can produce with nothing but simple text prompts.
Prompting for the current generation of text-to-image diffusion models (such as Stable Diffusion XL) has been observed to be highly brittle. It isn't easy to create an optimal prompting strategy that reliably generates images of a certain quality and sometimes even reliably follows the prompt to generate the image.
In this report, you will learn how to:
  • Implement an LLM-assisted prompt upsampling strategy that can improve the quality of generated images even when using simple prompts.
  • Implement an LLM-assisted evaluation strategy to assess the correctness of the generated images.
  • Track all the generation and evaluation calls using Weave.
You can implement the code in this report in this Colab:


And finally, here's a look at what we'll be making:

Images generated using LLM-assisted prompt upsampling
1

Check out the following reports to learn more about prompting strategies for Stable Diffusion 👇


📋 Table of contents




⚙️ Tools you will need

In this report, you will learn how to implement our LLM-prompting strategy for prompt upsampling using DSPy, which pushes modular programming models for prompting. You will also learn how to implement a multi-modal prompting workflow using DSPy for our LLM-assisted evaluation strategy.
You will also learn how to use Weave, a lightweight toolkit developed by Weights & Biases for tracking and evaluating your GenAI applications.

Here's how you can track and explore your image generation and prompt upsampling calls on the Weave UI
1


🧐 What is prompt upsampling?

The secret sauce to getting high quality images from text-to-image diffusion models is to provide more control conditions. These models are not really "intelligent": if we don't tell them precisely what we want, they won't be able to generate images with many details. One way to achieve this is to manually write detailed prompts that give the diffusion model much more context.
Prompt-upsampling is a process that aims to automate the process of writing a detailed prompt using an LLM. The idea is to develop the most barebones idea for a prompt (such as "a man holding a sword") and let a powerful large language model like GPT-4 fill in the prompts with more details, ultimately resulting in a better and more detailed-looking image.

📎 Installations and Initial Setup


Installations and Initial Setup
0


🦄 Implementing prompt upsampling using DSPy

DSPy is a framework that pushes building new LM pipelines away from manipulating free-form strings and closer to programming (composing modular operators to build text transformation graphs), where a compiler automatically generates optimized LM invocation strategies and prompts from a program.
According to the DSPy programming model, string-based prompting techniques are first translated into declarative modules with natural-language typed signatures. Then, each module is parameterized to learn its desired behavior by iteratively bootstrapping useful demonstrations within the pipeline.
You can check out the following report to learn more about DSPy and its integration with Weave 👇

Want to check out the full code and generate your own images? Run this colab 👇


We're going to use the dspy.OpenAI abstraction to make LLM calls to GPT4.
import dspy

upsampler_llm = dspy.OpenAI(
model="gpt-4",
system_prompt="""
You are part of a team of bots that creates images. You work with an assistant bot that will draw anything
you say in square brackets. For example, outputting "a beautiful morning in the woods with the sun peaking
through the trees" will trigger your partner bot to output an image of a forest morning, as described.
You will be prompted by people looking to create detailed, amazing images. The way to accomplish this is to
take their short prompts and make them extremely detailed and descriptive.

There are a few rules to follow:
- You will only ever output a single image description per user request.
- Often times, the base prompt might consist of spelling mistakes or grammatical errors. You should correct
such errors before making them extremely detailed and descriptive.
- Image descriptions must be between 15-80 words. Extra words will be ignored.
""",
)
We adopt the system prompt for the upsampling workflow from Appendix C in the paper Improving Image Generation with Better Captions.
💡
Next, we create a simple signature specifying the input and output behavior of the prompt-upsampling module.
class PromptUpsamplingSignature(dspy.Signature):
base_prompt = dspy.InputField()
answer = dspy.OutputField(
desc="Create an imaginative image descriptive caption for the given base prompt."
)
We're going to use the dspy.MultiChainComparison module to execute prompt upsampling. This method aggregates all the reasoning attempts and calls the predict method with extended signatures to get the best reasoning.
reasoning_attemps = [
dspy.Prediction(
rationale="a man holding a sword",
answer="a pale figure with long white hair stands in the center of a dark forest, holding a sword high above his head.",
),
dspy.Prediction(
rationale="a frog playing dominoes",
answer="a frog sits on a worn table playing a game of dominoes with an elderly raccoon. the table is covered in a green cloth, and the frog is wearing a jacket and a pair of jeans. The scene is set in a forest, with a large tree in the background.",
),
dspy.Prediction(
rationale="A bird scaring a scarecrow",
answer="A large, vibrant bird with an impressive wingspan swoops down from the sky, letting out a piercing call as it approaches a weathered scarecrow in a sunlit field. The scarecrow, dressed in tattered clothing and a straw hat, appears to tremble, almost as if it's coming to life in fear of the approaching bird.",
),
# ... you can implement more reasoning attempts for better result
]

prompt_upsampling_module = dspy.MultiChainComparison(
PromptUpsamplingSignature, M=len(reasoning_attemps)
)
Next, we wrap the prompt upsampling and subsequent image generation calls using weave.Model. A Weave Model combines data (including configuration, trained model weights, or other information) and code defining the model's operation. By structuring your code to be compatible with this API, you benefit from a structured way to version your application so you can more systematically keep track of your experiments.
import torch
from diffusers import AutoPipelineForText2Image, DiffusionPipeline


class PromptUpsamplingModel(weave.Model):

@weave.op()
def predict(self, base_prompt) -> dict:
with dspy.context(lm=upsampler_llm):
return prompt_upsampling_module(
reasoning_attemps, base_prompt=base_prompt
).answer


class StableDiffusionXLModel(weave.Model):
diffusion_model: str
enable_cpu_offload: bool = True
prompt_upsampler: PromptUpsamplingModel
_pipeline: DiffusionPipeline

def __init__(
self,
diffusion_model: str,
enable_cpu_offload: bool,
prompt_upsampler: PromptUpsamplingModel
):
super().__init__(
diffusion_model=diffusion_model,
enable_cpu_offload=enable_cpu_offload,
prompt_upsampler=prompt_upsampler,
)
self._pipeline = AutoPipelineForText2Image.from_pretrained(
self.diffusion_model,
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True,
)
if self.enable_cpu_offload:
self._pipeline.enable_model_cpu_offload()
else:
self._pipeline = self._pipeline.to("cuda")
@weave.op()
def predict(
self,
base_prompt: str,
negative_prompt: Optional[str] = None,
num_inference_steps: Optional[int] = 50,
image_size: Optional[int] = 1024,
guidance_scale: Optional[float] = 7.0,
) -> dict:
upsampled_prompt = self.prompt_upsampler.predict(base_prompt)
image = self._pipeline(
prompt=upsampled_prompt,
negative_prompt=negative_prompt,
num_inference_steps=num_inference_steps,
height=image_size,
width=image_size,
guidance_scale=guidance_scale,
).images[0]
return {
"upsampled_prompt": upsampled_prompt,
"image": base64_encode_image(image)
}


prompt_upsampler=PromptUpsamplingModel()

model = StableDiffusionXLModel(
diffusion_model="stabilityai/stable-diffusion-xl-base-1.0",
enable_cpu_offload=True,
prompt_upsampler=prompt_upsampler,
)

sdxl_prediction = model.predict(base_prompt="a frog dressed as a knight")
Running the prompt optimization step could cost ~$0.05 in OpenAI credits. To reduce the cost, you can use a cheaper model like GPT4-O or GPT-3.5-Turbo instead of GPT-4.
💡
Here's the trace for image generation with the upsampled prompt. You can check out a sample trace in the Weave UI.

📹 Building a multi-modal evaluation judge using DSPy

Let's also not try to implement an LLM-assisted evaluation strategy to automatically evaluate our generated images for prompt-following, i.e., how accurately the generated image follows the corresponding base prompt. To implement this metric, we use a multi-modal LLM like GPT4-O to look at the generated images and the base prompt and ask it to assign a correctness score between 0 and 1 and justify the score with an explanation.

🤖 Building a custom multi-modal OpenAI interface for DSPy

DSPy doesn't natively support multi-modal prompts. Hence, we first build a custom language model interface called DSPyOpenAIMultiModalLM on top of dsp.GPT3 and implement the logic for interpreting multi-modal prompts. This class can now act as a drop-in replacement for dspy.OpenAI for multi-modal prompts with base64 encoded images.
from dsp import GPT3
from openai import OpenAI


class DSPyOpenAIMultiModalLM(GPT3):

def __init__(
self,
model: str = "gpt-4o",
api_key: str | None = None,
system_prompt: str | None = None,
**kwargs,
):
super().__init__(
model,
api_key,
api_provider="openai",
api_base=None,
model_type=None,
system_prompt=system_prompt,
**kwargs,
)
self.model_type = model
self._openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

@weave.op()
def create_messages(self, prompt: str):
images = find_base64_images(prompt)
for image in images:
prompt = prompt.replace(image, "")

user_prompt = [{"type": "text", "text": prompt}]
for image in images:
user_prompt.append({"type": "image_url", "image_url": {"url": image}})
messages = []
if self.system_prompt:
messages.append({"role": "system", "content": self.system_prompt})
messages.append({"role": "user", "content": user_prompt})
return messages

@weave.op()
def basic_request(self, prompt: str, **kwargs):
messages = self.create_messages(prompt)
response = self._openai_client.chat.completions.create(
model=self.model_type, messages=messages, **kwargs
)
self.history.append({"prompt": prompt, "response": response, "kwargs": kwargs})
return response

@weave.op()
def request(self, prompt: str, **kwargs):
return super().request(prompt, **kwargs)

@weave.op()
def __call__(
self, prompt: str, only_completed: bool = True, **kwargs
) -> list:
response = self.request(prompt, **kwargs)
choices = (
[choice for choice in response.choices if choice.finish_reason == "stop"]
if only_completed and len(response.choices) != 0
else response.choices
)
return [choice.message.content for choice in choices]
Want to check out the full code and generate your images? Run this colab 👇


⛩️ Using DSPy Typed Predictors to Ensure Structured Outputs

Next, we define the judge module's DSPy signature to structure the inputs and outputs according to a fixed pydantic schema. When building the predictor for the JudgeSignature, we use dspy.TypedPredictor, that lets us provide the input and parse the output of the module in a structured manner that is consistent with the pydantic schema.
class JudgeInput(BaseModel):
base_prompt: str = Field(description="The base prompt used to generate the image")
generated_image: str = Field(description="The generated image")


class JudgeMent(BaseModel):
think_out_loud: str = Field(
description="Think out loud about your eventual judgement"
)
score: float = Field(description="A score between 0 and 1")
judgement: str = Field(description="Output either 'correct' or 'incorrect'")


class JudgeSignature(dspy.Signature):
input: JudgeInput = dspy.InputField()
output: JudgeMent = dspy.OutputField()


class MultiModalJudgeModule(dspy.Module):

def __init__(self):
self.prog = dspy.TypedPredictor(JudgeSignature)

@weave.op()
def forward(self, base_prompt: str, generated_image: str) -> dict:
return self.prog(
input=JudgeInput(
base_prompt=base_prompt, generated_image=generated_image
)
).output


judgement_module = MultiModalJudgeModule()

⚖️ Building the Judge as a Weave Model

We will adopt the evaluation prompt from Appendix D of the paper Improving Image Generation with Better Captions as the multi-modal judge's system prompt.
JUDGE_SYSTEM_PROMPT = """
You are responsible for judging the faithfulness of images generated by a computer program to the
base prompt used to generate them. You will be presented with an image and given the base prompt
that was used to produce the image. The base prompts you are judging are designed to stress-test
image generation programs, and may include things such as:
1. Scrambled or mis-spelled words (the image generator should an image associated with
the probably meaning).
2. Color assignment (the image generator should apply the correct color to the correct object).
3. Color assignment (the image generator should apply the correct color to the correct object).
4. Abnormal associations, for example 'elephant under a sea', where the image should depict
what is requested.
5. Descriptions of objects, the image generator should draw the most commonly associated object.
6. Rare single words, where the image generator should create an image somewhat associable with
the specified image.
7. Images with text in them, where the image generator should create an image with the specified
text in it. You need to make a decision as to whether or not the image is correct, given the
base prompt.

You will first think out loud about your eventual judgement, enumerating reasons why the image
does or does not match the given base prompt. After thinking out loud, you should assign a score
between 0 and 1 depending on how much you think the image is faithful to the base prompt. Next,
you should output either 'correct' or 'incorrect' depending on whether you think the image is
faithful to the base prompt.

A few rules:
1. The score should be used to indicate how close the image is to the base prompt in terms of objects,
color or count; with 0 being very far and 1 being very close.
2. If other objects are present in the image that are not explicitly mentioned by the base prompt,
assign a higher score.
3. If the objects being displayed is deformed, assign a lower score. Assign a higher score, if the objects
are displayed in a more detailed manner.
4. 'incorrect' should be reserved for instances where a specific aspect of the base prompt is not followed
correctly, such as a wrong object, color or count and the score should be less than or equal to 0.5.
"""
Finally, we will write the OpenAI multi-modal judge as a Weave model.
class OpenAIJudgeModel(weave.Model):
openai_model: str
seed: int
_judgement_llm: dspy.Module

def __init__(self, openai_model: str = "gpt-4-turbo", seed: int = 42):
super().__init__(openai_model=openai_model, seed=seed)
self._judgement_llm = DSPyOpenAIMultiModalLM(
model="gpt-4o", system_prompt=JUDGE_SYSTEM_PROMPT, seed=self.seed
)

@weave.op()
def predict(self, base_prompt: str, generated_image: str) -> JudgeMent:
with dspy.context(lm=self._judgement_llm):
judgement = judgement_module(base_prompt, generated_image)
return judgement

@weave.op()
def score(self, base_prompt: str, model_output: dict) -> dict:
judgement: JudgeMent = self.predict(
base_prompt=base_prompt, generated_image=model_output["image"]
)
return {
"score": judgement.score,
"is_image_correct": judgement.judgement == "correct",
}

judge_model = OpenAIJudgeModel()
judgement = judge_model.score("a frog dressed as a knight", sdxl_prediction)
Here's the multi-modal OpenAI judge trace for base prompts and generated images. You can check out a sample trace in the Weave UI.
Want to check out the full code and generate your images? Run this colab 👇


🚀 Exploring the Results

Let's explore some images generated by upsampling simple prompts.

🌤️ Generate images that are more visually detailed

Since prompt upsampling's premise is to add relevant details to the base prompt automatically, the resultant images also consist of more visual details and vibrant color palettes, which tend to make them look more aesthetically pleasing. Below, you'll see a base prompt, an upsampled prompt, and the outputs for both:

Comparison of generated images with and withot prompt upsampling
1


👾 Help represent abstract concepts better with barebones prompts

Since an upsampled prompt contains additional details consistent with the base prompt, it is often pretty effective in the case of single-word prompts or event prompts regarding concepts of an abstract nature.

Prompt upsamling in action for barebones prompts and abstract ideas
1


💔 Upsampling doesn't magically improve the ability of the model to follow prompts

Caption upsampling was initially proposed in the DALLE-3 paper as a recipe to improve the prompt-following capability of a text-to-image diffusion model by training it on a dataset with detailed and descriptive captions corresponding to the images. However, applying this technique during inference time will not automatically guarantee better prompt-following. Here are some examples demonstrating the failure of the prompt upsampling strategy.

Failure cases for the Prompt upsampling strategy
1


⚗️ More potential experiments

🏁 Conclusion

  • In this report, we learned about prompt upsampling. This prompting technique automatically adds details to a short and basic prompt using a large language model to convert it into a descriptive one. Thus, it reliably results in vibrant and aesthetically pleasing images with many more visual details.
  • We learned how to implement this prompting strategy using the dspy.MultiChainComparison module and tracking and versioning it using Weave.
  • We also implemented an LLM-assisted evaluation technique to automatically judge the correctness of the generated images concerning how closely the base prompt was followed in generating the images. The LLM judge is implemented as a Weave Model.
  • We also propose a strategy to implement this LLM-judge as a multi-modal DSPy Module by adding multi-modal prompt processing capabilities to the OpenAI interface for DSPy.
  • Finally, we compare the results generated using the prompt upsampling strategy to the base prompt.
  • The code for this project can be found at the repository github.com/soumik12345/diffusion_prompt_upsampling.

📕 Further resources

We have a free prompt engineering course here to help you think about how to structure your prompts. Also, check out the following reports to learn more about developing LLM applications.

Iterate on AI agents and models faster. Try Weights & Biases today.