Prompt upsampling for diffusion models
This article shows the implementation of an LLM-assisted prompt upsampling strategy to improve the quality of images generated by Stable Diffusion.
Created on August 9|Last edited on August 13
Comment
🎬 Introduction
Recent advances in diffusion models have allowed text-to-image generation models to achieve drastic performance improvements. These advances have paved the way for training large text-to-image generation models such as Stable Diffusion XL, PixArt-Σ, and, more recently, the Flux series of models from Black Forest Labs, which are capable of generating images rapidly approaching the quality of photographs and artwork that humans can produce with nothing but simple text prompts.
Prompting for the current generation of text-to-image diffusion models (such as Stable Diffusion XL) has been observed to be highly brittle. It isn't easy to create an optimal prompting strategy that reliably generates images of a certain quality and sometimes even reliably follows the prompt to generate the image.
In this report, you will learn how to:
- Implement an LLM-assisted prompt upsampling strategy that can improve the quality of generated images even when using simple prompts.
- Implement an LLM-assisted evaluation strategy to assess the correctness of the generated images.
You can implement the code in this report in this Colab:
And finally, here's a look at what we'll be making:
Images generated using LLM-assisted prompt upsampling
1
Check out the following reports to learn more about prompting strategies for Stable Diffusion 👇
A Guide to Using Stable Diffusion XL with HuggingFace Diffusers and W&B
A comprehensive guide to using Stable Diffusion XL (SDXL) for generating high-quality images using HuggingFace Diffusers and managing experiments with Weights & Biases
A Guide to Prompt Engineering for Stable Diffusion
A comprehensive guide to prompt engineering for generating images using Stable Diffusion, HuggingFace Diffusers and Weights & Biases.
📋 Table of contents
🎬 Introduction📋 Table of contents⚙️ Tools you will need🧐 What is prompt upsampling?📎 Installations and Initial Setup🦄 Implementing prompt upsampling using DSPy📹 Building a multi-modal evaluation judge using DSPy🤖 Building a custom multi-modal OpenAI interface for DSPy⛩️ Using DSPy Typed Predictors to Ensure Structured Outputs⚖️ Building the Judge as a Weave Model🚀 Exploring the Results🌤️ Generate images that are more visually detailed👾 Help represent abstract concepts better with barebones prompts💔 Upsampling doesn't magically improve the ability of the model to follow prompts⚗️ More potential experiments🏁 Conclusion📕 Further resources
⚙️ Tools you will need
In this report, you will learn how to implement our LLM-prompting strategy for prompt upsampling using DSPy, which pushes modular programming models for prompting. You will also learn how to implement a multi-modal prompting workflow using DSPy for our LLM-assisted evaluation strategy.
You will also learn how to use Weave, a lightweight toolkit developed by Weights & Biases for tracking and evaluating your GenAI applications.
Here's how you can track and explore your image generation and prompt upsampling calls on the Weave UI
1
🧐 What is prompt upsampling?
The secret sauce to getting high quality images from text-to-image diffusion models is to provide more control conditions. These models are not really "intelligent": if we don't tell them precisely what we want, they won't be able to generate images with many details. One way to achieve this is to manually write detailed prompts that give the diffusion model much more context.
Prompt-upsampling is a process that aims to automate the process of writing a detailed prompt using an LLM. The idea is to develop the most barebones idea for a prompt (such as "a man holding a sword") and let a powerful large language model like GPT-4 fill in the prompts with more details, ultimately resulting in a better and more detailed-looking image.
📎 Installations and Initial Setup
Installations and Initial Setup
0
🦄 Implementing prompt upsampling using DSPy
DSPy is a framework that pushes building new LM pipelines away from manipulating free-form strings and closer to programming (composing modular operators to build text transformation graphs), where a compiler automatically generates optimized LM invocation strategies and prompts from a program.
According to the DSPy programming model, string-based prompting techniques are first translated into declarative modules with natural-language typed signatures. Then, each module is parameterized to learn its desired behavior by iteratively bootstrapping useful demonstrations within the pipeline.
You can check out the following report to learn more about DSPy and its integration with Weave 👇
Want to check out the full code and generate your own images? Run this colab 👇
import dspyupsampler_llm = dspy.OpenAI(model="gpt-4",system_prompt="""You are part of a team of bots that creates images. You work with an assistant bot that will draw anythingyou say in square brackets. For example, outputting "a beautiful morning in the woods with the sun peakingthrough the trees" will trigger your partner bot to output an image of a forest morning, as described.You will be prompted by people looking to create detailed, amazing images. The way to accomplish this is totake their short prompts and make them extremely detailed and descriptive.There are a few rules to follow:- You will only ever output a single image description per user request.- Often times, the base prompt might consist of spelling mistakes or grammatical errors. You should correctsuch errors before making them extremely detailed and descriptive.- Image descriptions must be between 15-80 words. Extra words will be ignored.""",)
We adopt the system prompt for the upsampling workflow from Appendix C in the paper Improving Image Generation with Better Captions.
💡
Next, we create a simple signature specifying the input and output behavior of the prompt-upsampling module.
class PromptUpsamplingSignature(dspy.Signature):base_prompt = dspy.InputField()answer = dspy.OutputField(desc="Create an imaginative image descriptive caption for the given base prompt.")
We're going to use the dspy.MultiChainComparison module to execute prompt upsampling. This method aggregates all the reasoning attempts and calls the predict method with extended signatures to get the best reasoning.
reasoning_attemps = [dspy.Prediction(rationale="a man holding a sword",answer="a pale figure with long white hair stands in the center of a dark forest, holding a sword high above his head.",),dspy.Prediction(rationale="a frog playing dominoes",answer="a frog sits on a worn table playing a game of dominoes with an elderly raccoon. the table is covered in a green cloth, and the frog is wearing a jacket and a pair of jeans. The scene is set in a forest, with a large tree in the background.",),dspy.Prediction(rationale="A bird scaring a scarecrow",answer="A large, vibrant bird with an impressive wingspan swoops down from the sky, letting out a piercing call as it approaches a weathered scarecrow in a sunlit field. The scarecrow, dressed in tattered clothing and a straw hat, appears to tremble, almost as if it's coming to life in fear of the approaching bird.",),# ... you can implement more reasoning attempts for better result]prompt_upsampling_module = dspy.MultiChainComparison(PromptUpsamplingSignature, M=len(reasoning_attemps))
Next, we wrap the prompt upsampling and subsequent image generation calls using weave.Model. A Weave Model combines data (including configuration, trained model weights, or other information) and code defining the model's operation. By structuring your code to be compatible with this API, you benefit from a structured way to version your application so you can more systematically keep track of your experiments.
import torchfrom diffusers import AutoPipelineForText2Image, DiffusionPipelineclass PromptUpsamplingModel(weave.Model):@weave.op()def predict(self, base_prompt) -> dict:with dspy.context(lm=upsampler_llm):return prompt_upsampling_module(reasoning_attemps, base_prompt=base_prompt).answerclass StableDiffusionXLModel(weave.Model):diffusion_model: strenable_cpu_offload: bool = Trueprompt_upsampler: PromptUpsamplingModel_pipeline: DiffusionPipelinedef __init__(self,diffusion_model: str,enable_cpu_offload: bool,prompt_upsampler: PromptUpsamplingModel):super().__init__(diffusion_model=diffusion_model,enable_cpu_offload=enable_cpu_offload,prompt_upsampler=prompt_upsampler,)self._pipeline = AutoPipelineForText2Image.from_pretrained(self.diffusion_model,torch_dtype=torch.float16,variant="fp16",use_safetensors=True,)if self.enable_cpu_offload:self._pipeline.enable_model_cpu_offload()else:self._pipeline = self._pipeline.to("cuda")@weave.op()def predict(self,base_prompt: str,negative_prompt: Optional[str] = None,num_inference_steps: Optional[int] = 50,image_size: Optional[int] = 1024,guidance_scale: Optional[float] = 7.0,) -> dict:upsampled_prompt = self.prompt_upsampler.predict(base_prompt)image = self._pipeline(prompt=upsampled_prompt,negative_prompt=negative_prompt,num_inference_steps=num_inference_steps,height=image_size,width=image_size,guidance_scale=guidance_scale,).images[0]return {"upsampled_prompt": upsampled_prompt,"image": base64_encode_image(image)}prompt_upsampler=PromptUpsamplingModel()model = StableDiffusionXLModel(diffusion_model="stabilityai/stable-diffusion-xl-base-1.0",enable_cpu_offload=True,prompt_upsampler=prompt_upsampler,)sdxl_prediction = model.predict(base_prompt="a frog dressed as a knight")
Running the prompt optimization step could cost ~$0.05 in OpenAI credits. To reduce the cost, you can use a cheaper model like GPT4-O or GPT-3.5-Turbo instead of GPT-4.
💡

Here's the trace for image generation with the upsampled prompt. You can check out a sample trace in the Weave UI.
📹 Building a multi-modal evaluation judge using DSPy
Let's also not try to implement an LLM-assisted evaluation strategy to automatically evaluate our generated images for prompt-following, i.e., how accurately the generated image follows the corresponding base prompt. To implement this metric, we use a multi-modal LLM like GPT4-O to look at the generated images and the base prompt and ask it to assign a correctness score between 0 and 1 and justify the score with an explanation.
🤖 Building a custom multi-modal OpenAI interface for DSPy
DSPy doesn't natively support multi-modal prompts. Hence, we first build a custom language model interface called DSPyOpenAIMultiModalLM on top of dsp.GPT3 and implement the logic for interpreting multi-modal prompts. This class can now act as a drop-in replacement for dspy.OpenAI for multi-modal prompts with base64 encoded images.
from dsp import GPT3from openai import OpenAIclass DSPyOpenAIMultiModalLM(GPT3):def __init__(self,model: str = "gpt-4o",api_key: str | None = None,system_prompt: str | None = None,**kwargs,):super().__init__(model,api_key,api_provider="openai",api_base=None,model_type=None,system_prompt=system_prompt,**kwargs,)self.model_type = modelself._openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))@weave.op()def create_messages(self, prompt: str):images = find_base64_images(prompt)for image in images:prompt = prompt.replace(image, "")user_prompt = [{"type": "text", "text": prompt}]for image in images:user_prompt.append({"type": "image_url", "image_url": {"url": image}})messages = []if self.system_prompt:messages.append({"role": "system", "content": self.system_prompt})messages.append({"role": "user", "content": user_prompt})return messages@weave.op()def basic_request(self, prompt: str, **kwargs):messages = self.create_messages(prompt)response = self._openai_client.chat.completions.create(model=self.model_type, messages=messages, **kwargs)self.history.append({"prompt": prompt, "response": response, "kwargs": kwargs})return response@weave.op()def request(self, prompt: str, **kwargs):return super().request(prompt, **kwargs)@weave.op()def __call__(self, prompt: str, only_completed: bool = True, **kwargs) -> list:response = self.request(prompt, **kwargs)choices = ([choice for choice in response.choices if choice.finish_reason == "stop"]if only_completed and len(response.choices) != 0else response.choices)return [choice.message.content for choice in choices]
Want to check out the full code and generate your images? Run this colab 👇
⛩️ Using DSPy Typed Predictors to Ensure Structured Outputs
Next, we define the judge module's DSPy signature to structure the inputs and outputs according to a fixed pydantic schema. When building the predictor for the JudgeSignature, we use dspy.TypedPredictor, that lets us provide the input and parse the output of the module in a structured manner that is consistent with the pydantic schema.
class JudgeInput(BaseModel):base_prompt: str = Field(description="The base prompt used to generate the image")generated_image: str = Field(description="The generated image")class JudgeMent(BaseModel):think_out_loud: str = Field(description="Think out loud about your eventual judgement")score: float = Field(description="A score between 0 and 1")judgement: str = Field(description="Output either 'correct' or 'incorrect'")class JudgeSignature(dspy.Signature):input: JudgeInput = dspy.InputField()output: JudgeMent = dspy.OutputField()class MultiModalJudgeModule(dspy.Module):def __init__(self):self.prog = dspy.TypedPredictor(JudgeSignature)@weave.op()def forward(self, base_prompt: str, generated_image: str) -> dict:return self.prog(input=JudgeInput(base_prompt=base_prompt, generated_image=generated_image)).outputjudgement_module = MultiModalJudgeModule()
⚖️ Building the Judge as a Weave Model
We will adopt the evaluation prompt from Appendix D of the paper Improving Image Generation with Better Captions as the multi-modal judge's system prompt.
JUDGE_SYSTEM_PROMPT = """You are responsible for judging the faithfulness of images generated by a computer program to thebase prompt used to generate them. You will be presented with an image and given the base promptthat was used to produce the image. The base prompts you are judging are designed to stress-testimage generation programs, and may include things such as:1. Scrambled or mis-spelled words (the image generator should an image associated withthe probably meaning).2. Color assignment (the image generator should apply the correct color to the correct object).3. Color assignment (the image generator should apply the correct color to the correct object).4. Abnormal associations, for example 'elephant under a sea', where the image should depictwhat is requested.5. Descriptions of objects, the image generator should draw the most commonly associated object.6. Rare single words, where the image generator should create an image somewhat associable withthe specified image.7. Images with text in them, where the image generator should create an image with the specifiedtext in it. You need to make a decision as to whether or not the image is correct, given thebase prompt.You will first think out loud about your eventual judgement, enumerating reasons why the imagedoes or does not match the given base prompt. After thinking out loud, you should assign a scorebetween 0 and 1 depending on how much you think the image is faithful to the base prompt. Next,you should output either 'correct' or 'incorrect' depending on whether you think the image isfaithful to the base prompt.A few rules:1. The score should be used to indicate how close the image is to the base prompt in terms of objects,color or count; with 0 being very far and 1 being very close.2. If other objects are present in the image that are not explicitly mentioned by the base prompt,assign a higher score.3. If the objects being displayed is deformed, assign a lower score. Assign a higher score, if the objectsare displayed in a more detailed manner.4. 'incorrect' should be reserved for instances where a specific aspect of the base prompt is not followedcorrectly, such as a wrong object, color or count and the score should be less than or equal to 0.5."""
Finally, we will write the OpenAI multi-modal judge as a Weave model.
class OpenAIJudgeModel(weave.Model):openai_model: strseed: int_judgement_llm: dspy.Moduledef __init__(self, openai_model: str = "gpt-4-turbo", seed: int = 42):super().__init__(openai_model=openai_model, seed=seed)self._judgement_llm = DSPyOpenAIMultiModalLM(model="gpt-4o", system_prompt=JUDGE_SYSTEM_PROMPT, seed=self.seed)@weave.op()def predict(self, base_prompt: str, generated_image: str) -> JudgeMent:with dspy.context(lm=self._judgement_llm):judgement = judgement_module(base_prompt, generated_image)return judgement@weave.op()def score(self, base_prompt: str, model_output: dict) -> dict:judgement: JudgeMent = self.predict(base_prompt=base_prompt, generated_image=model_output["image"])return {"score": judgement.score,"is_image_correct": judgement.judgement == "correct",}judge_model = OpenAIJudgeModel()judgement = judge_model.score("a frog dressed as a knight", sdxl_prediction)

Here's the multi-modal OpenAI judge trace for base prompts and generated images. You can check out a sample trace in the Weave UI.
Want to check out the full code and generate your images? Run this colab 👇
🚀 Exploring the Results
Let's explore some images generated by upsampling simple prompts.
🌤️ Generate images that are more visually detailed
Since prompt upsampling's premise is to add relevant details to the base prompt automatically, the resultant images also consist of more visual details and vibrant color palettes, which tend to make them look more aesthetically pleasing. Below, you'll see a base prompt, an upsampled prompt, and the outputs for both:
Comparison of generated images with and withot prompt upsampling
1
👾 Help represent abstract concepts better with barebones prompts
Since an upsampled prompt contains additional details consistent with the base prompt, it is often pretty effective in the case of single-word prompts or event prompts regarding concepts of an abstract nature.
Prompt upsamling in action for barebones prompts and abstract ideas
1
💔 Upsampling doesn't magically improve the ability of the model to follow prompts
Caption upsampling was initially proposed in the DALLE-3 paper as a recipe to improve the prompt-following capability of a text-to-image diffusion model by training it on a dataset with detailed and descriptive captions corresponding to the images. However, applying this technique during inference time will not automatically guarantee better prompt-following. Here are some examples demonstrating the failure of the prompt upsampling strategy.
Failure cases for the Prompt upsampling strategy
1
⚗️ More potential experiments
- Run evaluations with the LLM Judge using Weave Evaluations on datasets like DrawBench and Parti-Prompts.
- Since the entire prompting strategy is implemented using DSPy, we can attempt to optimize it further using DSPy Teleprompters.
🏁 Conclusion
- In this report, we learned about prompt upsampling. This prompting technique automatically adds details to a short and basic prompt using a large language model to convert it into a descriptive one. Thus, it reliably results in vibrant and aesthetically pleasing images with many more visual details.
- We learned how to implement this prompting strategy using the dspy.MultiChainComparison module and tracking and versioning it using Weave.
- We also implemented an LLM-assisted evaluation technique to automatically judge the correctness of the generated images concerning how closely the base prompt was followed in generating the images. The LLM judge is implemented as a Weave Model.
- We also propose a strategy to implement this LLM-judge as a multi-modal DSPy Module by adding multi-modal prompt processing capabilities to the OpenAI interface for DSPy.
- Finally, we compare the results generated using the prompt upsampling strategy to the base prompt.
- The code for this project can be found at the repository github.com/soumik12345/diffusion_prompt_upsampling.
📕 Further resources
We have a free prompt engineering course here to help you think about how to structure your prompts. Also, check out the following reports to learn more about developing LLM applications.
How to optimize LLM workflows using DSPy and W&B Weave
Learn how to use DSPy teleprompters and Weave to automatically optimize prompting strategies for causal reasoning
Building an AI teacher's assistant using LlamaIndex and Groq
Today, we're going to leverage a RAG pipeline to create an AI TA capable of helping out with grading, questions about a class syllabus, and more
Refactoring Wandbot—our LLM-powered document assistant—for improved efficiency and speed
This report tells the story of how we utilized auto-evaluation-driven development to enhance both the quality and speed of Wandbot.
GPT-4o Python quickstart using the OpenAI API
Getting set up and running GPT-4o on your machine in Python using the OpenAI API.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.