A Guide to Using Stable Diffusion XL with HuggingFace Diffusers and W&B

A comprehensive guide to using Stable Diffusion XL (SDXL) for generating high-quality images using HuggingFace Diffusers and managing experiments with Weights & Biases
Soumik Rakshit
Created on August 2|Last edited on February 22
Comment
﻿
🎬 Introduction﻿Stable Diffusion XL 1.0 is the latest model in the Stable Diffusion family of text-to-image models from Stability AI. Stable Diffusion XL enables us to create gorgeous images with shorter descriptive prompts, as well as generate words within images. The model is a significant advancement in image generation capabilities, offering enhanced image composition and face generation that results in stunning visuals and realistic aesthetics.
In this report, we'll explore:
The architecture of Stable Diffusion XL. 
How we can use the 🧨 Diffusers library by 🤗 HuggingFace to generate stunning images.
How we can use Weights & Biases to manage our experiments and ensure reproducibility for our generations.
And how we can use Compel to perform prompt-weighing in order to achieve better control over our generated images with Stable Diffusion XL.
As a note, you can run the code in this report via this interactive colab: 
﻿
﻿
And, since this is a GenAI report, we know what you want up front: some good looking images to get you started. We got you.  
﻿
Examples of some stunning images generated by Stable Diffusion XL4
﻿
Table of Contents🎬 Introduction🦄 A Brief Overview of Stable Diffusion XL💨 Creating the Diffusion Models🎨 Performing Text-Conditional Image Generation🪄 Exploring the Results on Weights & Biases 🐝🦋 Using Compel to Assign Weights to the Tokens of Our Prompts👨‍🎨 Some Tips to Generate the Best Images Using SDXL🏁 Conclusion
﻿
﻿
🦄 A Brief Overview of Stable Diffusion XLStable Diffusion XL (SDXL) is a latent diffusion model for text-to-image synthesis proposed in the paper SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. Some notable improvements in the model architecture introduced by SDXL are:
SDXL leverages a larger UNet backbone. Three times the size, in fact. The increase in parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder.
SDXL leverages multiple novel conditioning schemes and is trained on multiple aspect ratios.
The image generation pipeline also leverages a specialized high-resolution refiner model which is used to improve the visual fidelity of samples generated by SDXL using an image-to-image diffusion technique.
The base diffusion model generates initial latent tensors of size 128x128, which can be passed through a Variational Autoencoder (VAE) model with KL loss to generate the high-resolution image.
The latent tensors could also be passed on to the refiner model that applies SDEdit, using the same prompt. Although the base SDXL model is capable of generating stunning images with high fidelity, using the refiner model useful in many cases, especially to refine samples of low local quality such as deformed faces, eyes, lips, etc.
SDXL and the refinement model use the same autoencoder. 
﻿
Visualization of the two-stage pipeline for SDXL. Source: Figure 1 from the paper.
﻿
💨 Creating the Diffusion ModelsNext, we'll explore how we can create the diffusion models using 🧨 Diffusers and set up Weights & Biases 🐝 for experiment management!
﻿
Creating the Diffusion models using 🧨 Diffusers and setting up Weights & Biases 🐝 for experiment management.0
﻿
🎨 Performing Text-Conditional Image GenerationNext, we'll pass the prompts (and the negative prompts) to the base model, then pass the output to the refiner for further refinement. 
In order to know more about the different refinement techniques that can be used with SDXL, you can check docs for 🧨 Diffusers.
﻿
The image-generation and refinement pipelines from using 🧨 Diffusers0
﻿
🪄 Exploring the Results on Weights & Biases 🐝﻿
🪄 Building your own Art-Gallery!15
﻿
🦋 Using Compel to Assign Weights to the Tokens of Our Prompts﻿Compel is a text prompt weighting and blending library for transformers-type text embedding systems, developed by damian0815. Compel provides us with a flexible and intuitive syntax, that enables us to re-weight different parts of a prompt string and thus re-weight the different parts of the embedding tensor produced from the string. Compel is compatible with diffusers.DiffusionPipeline that we have been using in our image generation workflow till this point.
Using Compel, we can weigh the importance of certain tokens for our generations by adding a ++ or -- to increase or reduce their importance for generating the image, thus giving more control for generating the images.
﻿
Using Compel to Enhance Our Prmpts6
﻿
👨‍🎨 Some Tips to Generate the Best Images Using SDXLSDXL is a machine-learning model and doesn't really know how to read our minds. Hence, we need to be extremely specific with our prompts.
Make use of negative prompts to remove the undesired qualities or features from the generated image.
Make use of prompt weighing mechanics from Compel to gain more fine-grained control over your images.
Do not expect the generated images to achieve perfect photorealism or render legible text.
Make use of the refiner whenever possible, in a lot of situations, it can help with deformed faces and other features from the output of the base model.
SDXL does not guarantee to be free of any kind of social or regional biases.
SDXL works especially well with images between 768 and 1024.
Since SDXL has two text encoders, we can pass a different prompt for each of the text encoders. We can even pass different parts of the same prompt to the text encoders.
Finally, generating the perfect image that you're imagining is going to be an iterative process, spanning several experiments. Check out how many iterations it took me to render the following image 👇
﻿
Finally got the image I was looking for after 12 tries!12
﻿
🏁 ConclusionIn this report, we discussed and explore the amazing text-conditional image generation capabilities of Stable Diffusion XL 1.0.
We briefly explored the architecture of SDXL as proposed by the paper SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis.
We then explored how we can use the 🧨 Diffusers library by 🤗 HuggingFace to generate stunning images using SDXL.
We also explored how we can use Weights & Biases to manage our experiments and not only ensure reproducibility for our generations but maintain our own SDXL-generated art gallery!
We also discussed the different techniques to leverage the refiner model in the SDXL pipeline. For more detailed information on the usage of the refiner module, you can check out the original docs.
We also learned how we can use Compel to assign weight to the tokens in our prompts in order to achieve better control of the nature of our generations with SDXL.
I am immensely grateful to Sayak Paul for the guidance he provided while writing this report. I am also grateful to my friends Atanu, Arindam, and Dripto for sharing with me cool prompt ideas that helped me generate the stunning images used in this report.
﻿
DeepFloydAI: A New Breakthrough in Text-Guided Image Generation
In this article, we explore DeepFloydAI — an AI Research Band which is working with StabilityAI to make AI open again. 
Improving Generative Images with Instructions: Prompt-to-Prompt Image Editing with Cross Attention Control
A primer on text-driven image editing for large-scale text-based image synthesis models like Stable Diffusion & Imagen
Running Stable Diffusion on an Apple M1 Mac With HuggingFace Diffusers
In this article, we look at running Stable Diffusion on an M1 Mac with HuggingFace diffusers, highlighting the advantages — and the things to watch out for.
Fine-Tuning Stable Diffusion Using Dreambooth in Keras
In this article, we quickly teach Stable Diffusion new visual concepts using Dreambooth in Keras, to produce fully-novel photorealistic images of a given subject.
Making My Kid a Jedi Master With Stable Diffusion and Dreambooth
In this article, we'll explore how to teach and fine-tune Stable Diffusion to transform my son into his favorite Star Wars character using Dreambooth.
How To Train a Conditional Diffusion Model From Scratch
In this article, we look at how to train a conditional diffusion model and find out what you can learn by doing so, using W&B to log and track our experiments. 
﻿
﻿
Add a comment
Tags: Large Models, Articles, Panels, GenAI, Stable Diffusion, Computer Vision, Has Colab, Tables, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.