Skip to main content

InstructPix2Pix on HuggingFace

InstructPix2Pix edits images via text instructions and boasts fast editing.
Created on January 21|Last edited on January 21
InstructPix2Pix trains a conditional diffusion model to edit images given text instructions. How does it work?
Click here for their main webpage with their paper, GitHub, and HuggingFace Spaces Demo link: https://www.timothybrooks.com/instruct-pix2pix.
The authors of the paper divided this into 2 steps: custom dataset generation and InstructPix2Pix.
They opted for a supervised training approach where each image would have a caption plus an edited image and an edited caption. The example they use is "photograph of a girl riding a horse" and "photograph of a girl riding a dragon". And for both captions, there would be corresponding images. This dataset was generated with the combined help of GPT-3 and Stable Diffusion.

Generating the Dataset

GPT-3 was trained on a small subset of 700 image captions from LAION-Aesthetics V2 6.5+ and the authors manually wrote instructions and output captions. With already a wealth of background knowledge, GPT-3 was essentially finetuned to edit captions in a sensible way such that the image generated would make sense. With this finetuned GPT-3, they generated edited captions from all the captions in the LAION-Aesthetics dataset. This dataset is diverse in the sense that it includes different mediums and popular culture.
With a set of captions and their edited counterparts, these were passed into Stable Diffusion to generate images. Of course, Stable Diffusion doesn't guarantee that an edited caption only edits that particular aspect of the image. The authors leveraged a method called Prompt-to-Prompt which can control the "similarity" of 2 images.

Training InstructPix2Pix

They began with a pretrained Stable Diffusion checkpoint and added a few features to support image conditioning. They incorporated a technique called Classifier-free Diffusion Guidance. Classifier guidance is basically a method that gives you more control over conditioning the model on what you want it to generate. Classifier-free Diffusion Guidance is doing this without using a classifier! For now, I just think of this as a different way to improve overall control on the model's outputs. Check out this W&B article on Classifier Guidance.
Image generation is nothing new. Stable Diffusion and DALL-E both generate images. InstructPix2Pix specializes in image editing instead of image generation. The authors state that their model is much faster since all the edits to the image are done in 1 forward pass. The below image is from their paper and an example of their results!


Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.