Improving Generative Images with Instructions: Prompt-to-Prompt Image Editing with Cross Attention Control

A primer on text-driven image editing for large-scale text-based image synthesis models like Stable Diffusion & Imagen
Created on September 26|Last edited on September 29
Comment
﻿
﻿
﻿
﻿
IntroductionRecently, large-scale text-driven synthesis models (large-scale language-image or LLI models like DALL·E 2, Imagen, Craiyon, and Stable Diffusion) have attracted a lot of attention due to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to us humans since we've never really had the ability to verbally describe our intent and have it show up on screen–at least without the help of a rather skilled designer. The next logical step in the process is doing not simply generation but editing, which, as luck would have it, is the topic of our report today: text-driven image editing.
Before we jump in too deeply: editing is challenging for such types of generative models. This is because an innate property of an editing technique is to preserve most of the original image. In the aforementioned text-based models, however, even a small modification of the text prompt often leads to a completely different outcome. A lot of state-of-the-art methods require the user to input a spatial mask to localize the edit which has a lot of downsides. To name a few:
Providing a mask is cumbersome and doesn't ensure a good user experience (often quite the opposite in fact).
An edit conditioned on a mask-based localization might not yield good results if it has an inadequate shape.
An edit conditioned on a localized input mask would ignore the original structure and content within the masked region.
Our question then is this: 
Could there be a simple and intuitive way we could manipulate the images generated by a large-scale language-image model?This is the question that the authors of the paper Prompt-to-Prompt Image Editing with Cross Attention Control attempt to answer. The authors propose a simple prompt-to-prompt editing framework for large-scale language-image models where the edits are controlled by text only. The authors analyze Imagen, a text-conditioned image synthesis model, and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt.
In this report, we will take an in-depth look at how this approach to prompt-to-prompt image editing works. We'll also try to reproduce the ideas present in the paper using Stable Diffusion, a family of open-source large-scale text-driven synthesis models. However, before digging into the paper, let's take a look at a few results from applications of this approach applied to Stable Diffusion to get you excited about how well this works:
﻿
Editing an Image by Replacing a Word in the Prompt1
﻿
﻿
﻿
Editing an Image by Adding Multiple Specifications1
﻿
﻿
﻿
Editing an Image by Replacing Words1
﻿
This article was written as a Weights & Biases Report which is a project management and collaboration tool for machine learning projects. Reports let you organize and embed visualizations, describe your findings, share updates with collaborators, and more. To know more about reports, check out Collaborative Reports.
💡
﻿
The Key IdeaIn this paper, the authors introduce an intuitive and powerful textual editing method to semantically edit images in pre-trained text-conditioned diffusion models via prompt-to-prompt manipulations. 
To do so, the authors dive deep into the cross-attention layers and explore their semantic strength as a handle to control the generated image. Specifically, they analyze internal cross-attention maps, which are high-dimensional tensors that bind pixels and tokens extracted from the prompt text. They find that these maps contain rich semantic relations which critically affect the generated image.
The key idea proposed in this paper is that we can edit images by injecting the cross-attention maps during the diffusion process, controlling which pixels attend to which tokens of the prompt text during which diffusion steps.The proposed pipeline can be broken down into two steps:
The fusing of visual and textual embedding using cross-attention layers that produce spatial attention maps for each textual token
We then control the spatial layout and geometry of the generated image using the attention maps of a source image. This enables various editing tasks through editing the textual prompt only, such as:
swapping words by injecting the source image maps MtM_tMt​﻿ overriding the target image maps Mt∗M_t^*Mt∗​﻿ to preserve the spatial layout
adding a new phrase by injecting only the maps that correspond to the unchanged part of the prompt
amplification or attenuation of the semantic effect of a word achieved by re-weighting the corresponding attention map
Visual and textual embedding are fused using cross-attention layers that produce spatial attention maps for each textual token. Source: Figure 3 from the paper.
Controlling the spatial layout and geometry of the generated image using the attention maps of a source image, thus enabling various editing tasks through editing the textual prompt only. Source: Figure 3 from the paper.
﻿
﻿
The Proposed Prompt-to-prompt Image Editing Algorithm0
﻿
﻿
Applications of Prompt-to-Prompt Image EditingLet us now take a look at some applications of this simple and novel prompt-to-prompt image editing technique. Note that the results were produced using the open-sourced model Stable Diffusion instead of Imagen (which the authors used in the paper). This might lead to significant differences in the quality of the generated images. You can try to play with the image editing technique with your own prompts using the following Google Colab notebook!
﻿
﻿
﻿
Adding a New PhraseIn this setting, we add new tokens to the prompt. To preserve the common details, we apply the attention injection only over the common tokens from both prompts. Formally, we use an alignment function AAA﻿ that receives a token index from the target prompt P\mathcal{P}P﻿  and outputs the corresponding token index in P\mathcal{P}P﻿ or NoneNoneNone﻿ if there isn’t a match. Then, the editing function is given by:
(Edit⁡(Mt,Mt∗,t))i,j:={(Mt∗)i,j if A(j)= None (Mt)i,A(j) otherwise. \large{\left(\operatorname{Edit}\left(M_t, M_t^*, t\right)\right)_{i, j}:= \begin{cases}\left(M_t^*\right)_{i, j} & \text { if } A(j)=\text { None } \\ \left(M_t\right)_{i, A(j)} & \text { otherwise. }\end{cases}}(Edit(Mt​,Mt∗​,t))i,j​:=⎩⎨⎧​(Mt∗​)i,j​(Mt​)i,A(j)​​ if A(j)= None  otherwise. ​﻿
﻿
Following are some examples of editing global features of a generated image by inserting additional phrases to the original prompt. Some of these editings make use of amplification or attenuation of the semantic effect of a word achieved by re-weighting the corresponding attention map. These weights can be found in the Run table below the visualization table. These weights are denoted by a list of tuples where in each tuple, the first number denotes the index of the token in the edit prompt to be re-weighed and the next number denotes the new relative weight.
﻿
﻿
Editing global features of a generated image by inserting additional phrases to the original prompt8
﻿
Here are some examples of editing specific local features of a generated image by inserting additional phrases to the original prompt.
﻿
Editing local features of a generated image by inserting additional phrases to the original prompt7
﻿
﻿
Word SwapIn this setting, the user swaps tokens of the original prompt with others. The main challenge is to preserve the original composition while also addressing the content of the new prompt. In order to achieve this, the attention maps of the source image are injected into the generation with the modified prompt. However, the proposed attention injection may over constrain the geometry, especially when a large structural modification (for example, changing a "car" to a "bicycle"). This is addressed by suggesting a softer attention constraint which is given by...
Edit⁡(Mt,Mt∗,t):={Mt∗ if t<τMt otherwise. \large{\operatorname{Edit}\left(M_t, M_t^*, t\right):= \begin{cases}M_t^* & \text { if } t<\tau \\ M_t & \text { otherwise. }\end{cases}}Edit(Mt​,Mt∗​,t):=⎩⎨⎧​Mt∗​Mt​​ if t<τ otherwise. ​﻿
where τ\tauτ﻿ is a timestamp parameter that determines until which diffusion step the attention maps of the source image are injected into the prompt.
Note that the composition is determined in the early steps of the diffusion process. Therefore, by limiting the number of injection steps, we can guide the composition of the newly generated image while allowing the necessary geometry freedom for adapting to the new prompt.
💡
Here are some examples of editing a generated image by swapping words or phrases with the original prompt. 
﻿
Editing a generated image by swapping words or phrases with the original prom5
﻿
﻿
ConclusionIn this report, we discuss a simple and intuitive prompt-to-prompt image editing technique for large-scale pre-trained text-driven image synthesis models as proposed in the paper Prompt-to-Prompt Image Editing with Cross Attention Control.
We attempt to reproduce the results from the paper using the open-sourced Stable Diffusion instead of Imagen which was used by the authors.
We explored that the cross-attention layers from the text-driven image synthesis models have an interpretable representation of spatial maps that play a key role in tying the words in the text prompt to the spatial layout of the synthesized image.
We explored, based on the previous realization, how various manipulations of the prompt can directly control attributes in the synthesized image, paving the way to various applications including local and global editing.
The authors speculate that this work is a first step towards providing users with simple and intuitive means to edit images, leveraging textual semantic power.
﻿
If you'd like to try out editing some of your own generations, this colab will get you going. Feel free to drop any particularly notable generations in the prompt or tag us on Twitter at @weights_biases.﻿
Thanks for reading! 
﻿
﻿
﻿
﻿
Similar Reports
Stable Diffusion Settings and Storing Your Images
In this article, we explore the impact of different settings used for the Stable Diffusion model and how you can store your generated images for quick reference.
A Technical Guide to Diffusion Models for Audio Generation
Diffusion models are jumping from images to audio. Here's a look at their history, their architecture, and how they're being applied to this new domain
A Gentle Introduction to Dance Diffusion
Diffusion models are everywhere for images, but have yet to gain real traction in audio generation. That's changing thanks to Harmonai.
Running Stable Diffusion on an Apple M1 Mac With HuggingFace Diffusers
In this article, we look at running Stable Diffusion on an M1 Mac with HuggingFace diffusers, highlighting the advantages — and the things to watch out for.
Digging Into StyleGAN-NADA for CLIP-Guided Domain Adaptation
In this article, we take a deep dive into how StyleGAN-NADA achieved the task of CLIP-guided domain adaptation and explore how we can use the model itself.
PoE-GAN: Generating Images from Multi-Modal Inputs
PoE-GAN is a recent, fascinating paper where the authors generate images from multiple inputs like text, style, segmentation, and sketch. We dig into the architecture, the underlying math, and of course, generate some images along the way. 
﻿
﻿
Add a comment
Tags: Stable Diffusion, GenAI, Experiment, Advanced, Slider, Panels, Plots, Computer Vision
Iterate on AI agents and models faster. Try Weights & Biases today.