Paella: Fast Text-Conditional Image Generation
In this article, we explore the paper "Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces" which introduces Paella, a novel text-to-image model.
Created on November 20|Last edited on March 13
Comment
Conditional text-to-image generation has seen countless recent improvements in quality, diversity, and fidelity. Nevertheless, most state-of-the-art models require numerous inference steps to produce faithful generations, resulting in performance bottlenecks for end-user applications. The question we want to answer today:
Is it possible to reduce the number of inference steps for text-conditional image generation models to make it more feasible for end-user applications?
This is the problem that the authors of Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces attempt to solve. In this paper, the authors introduce Paella, a novel text-to-image model requiring less than ten steps to sample high-fidelity images, using a speed-optimized architecture allowing the sample of a single image in less than 500 ms while having 573M parameters.
The model operates on a compressed and quantized latent space. It is conditioned on CLIP embeddings and uses an improved sampling function over previous works. Aside from text-conditional image generation, our model is able to do latent space interpolation and image manipulations such as inpainting, outpainting, and structural editing.
Text-conditional Image Generation Results
12
The authors have released all of their code and pretrained models at https://github.com/dome272/Paella.
💡
Table of Contents
A Deep Dive Into PaellaDisadvantages of Existing ModelsWhat Paella Brings to the TableDownstream Applications of PaellaLatent InterpolationMulti-ConditioningInpainting and OutpaintingStructural MorphingImage VariationsConclusion
A Deep Dive Into Paella
Recent research in text-to-image generation (such as Make-A-Scene, Imagen, and Stable Diffusion) has yielded stunning progress regarding the diversity, quality, and variation of generated images. However, such impressive output quality has come with the tradeoff that they require many sampling steps, leading to slow inference speeds for end-user applications. Most of such models are based on diffusion or transformer models. Both of these approaches have some cons. Let's take a look.
Disadvantages of Existing Models
- Transformers usually employ a spatial compression to a low-dimensional space before learning, which is necessary due to the self-attention mechanism growing quadratically with latent space dimensions.
- Transformers treat images as one-dimensional sequences by flattening the encoded image tokens, which is an unnatural projection of images and requires a significantly higher model complexity to learn the 2D structure of images.
- The auto-regressive nature of transformers requires the sampling of one token at a time, resulting in long sampling times and high computational costs.
- Diffusion models, unlike transformers, can effectively learn at the pixel level but by default require a tremendously high number of sampling steps as well.
What Paella Brings to the Table
The authors propose a novel technique for text-conditional image generation that is neither transformer nor diffusion-based but utilizes a fully convolutional neural network architecture. With the proposed model, we can sample images with as few as eight steps while still achieving high-fidelity results, making the model attractive for use cases that are limited by requirements on latency, memory, or computational complexity.
Paella operates on a quantized latent space and employs a VQGAN for the encoding and decoding process with a moderate compression rate. Theoretically, a much lower compression rate can be used due to the convolutional nature of our model, which is not constrained by typical transformer limitations such as quadratic memory growth. A low compression rate enables the preservation of fine details usually lost when working with higher compression. During training, the images are quantized using the VQGAN and randomly noise the image tokens. The model is tasked to reconstruct the image tokens given the noised version and a conditional label.

A visual depiction of the overall architecture in the proposed method. The training of Paella operates on a compressed latent space. Latent images are noised, and the model is optimized to predict the denoised version of the image. Source: Figure 2 from the paper.
Sampling New Images
In Paella, sampling new images happen in an iterative fashion inspired by MaskGIT, but with a few significant changes.
- In order to provide more flexibility to the model instead of masking tokens they are randomly noised in Paella. This gives the model the opportunity to refine its predictions for certain tokens over the course of sampling.
- The sampling process is also improved by Classifier-Free Guidance which is achieved by randomly performing unconditional training.

The process of training and sampling mechanism for the token predictor of our model. Source: Figure 3 from the paper.
Conditioning the Image Generation
- Paella enables text conditioning using Contrastive Language-Image Pretraining (CLIP) embeddings similar to CLIP-GEN. However, instead of solely training on image embeddings and subsequently learning a prior network for mapping text embeddings to image embeddings, the model is only trained on text embeddings which decouples Paella from the dependency on an explicit prior (similar to High-Resolution Image Synthesis with Latent Diffusion Models) and reduces computational complexity.
- Due to the fact that Paella is fully convolutional, it can generate images at any size, in principle. This property can be used for outpainting images while still only requiring to sample once. Besides outpainting an image, Paella can also do text-guided inpainting.
- The ability to generate variations of images can be made possible by fine-tuning our model on image embeddings.
- The usage of CLIP embeddings enables us to do latent space interpolations.
- The flexibility of Paella also enables us to perform multi-conditioning and structural editing of base images.

A visualization of the model architecture of the token predictor. Source: Figure 4 from the paper.
Downstream Applications of Paella
The design choices of Paella's architecture and the specific way of conditioning unfold many possibilities for image synthesis tasks. Conditioning on CLIP enables us to generate variations of images, interpolate in the latent space, and perform structural edits between images. Moreover, since our model is fully convolutional, this naturally allows for sampling at any resolution, inpainting, and outpainting of images. Last but not least, our conditioning procedure enables us to condition different tokens on different embeddings. We have already seen a few results for text-conditional image generation. Let's dive into some of the other downsampling tasks in detail.
Latent Interpolation
Due to the fact that the CLIP space is continuous, it is possible to interpolate between points and samples along their trajectory. This results in smoothly transitioning between the concepts and semantics of images. Many sophisticated methods could be applied for generating the trajectory including simple linear interpolation and spherical-lerp.
Results for Latent Interpolation using lerp
4
Results for Latent Interpolation using spherical-lerp
4
Storing Latent Interpolation Results as Weights & Biases Tables (Click to Expand) 👉
Multi-Conditioning
The process of conditioning Paella happens through the modulated layer normalization, denormalized with the projected CLIP embedding afterwards. The processed embedding must be of the same dimensionality as the latent image. Conditioning on a single embedding is achieved by expanding the projection to the same latent size as the image. However, we can choose different embeddings for different parts of the image. Theoretically, every token can be conditioned on a distinct embedding from all others. This freedom of conditioning permits us to generate interesting cases of story-telling images.
Results for Multi-Conditioning Paella
4
Storing Multi-Conditioning Results as Weights & Biases Tables (Click to Expand) 👉
Inpainting and Outpainting
The convolutional nature of Paella allows us to generate latent images at any size. Unlike transformer models, we are not restricted to a context window and hence do not need to use shifting context windows to sample larger images. This property can be used for outpainting images, which refers to extending an existing image in any direction and filling in the semantically correct context. The steps involved in outpainting are as follows:
- An existing image is encoded to token space using the VQGAN encoder. For instance, an image with a height and width of is encoded to a latent image with a spatial resolution of .
- The image is extended in the desired directions (e.g., increases the image in the horizontal direction). The newly added tokens are initialized randomly.
- We start sampling as usual with the only difference being that after each iteration, the original image tokens are reset back to their initial values, and only keep the extended tokens.
In addition to semantically extending an image, we can also perform inpainting, i.e, replace and fill in existing content based on a textual prompt. The overall procedure is as follows:
- A mask is defined in the latent space for tokens that should be resampled.
- Then we start sampling and after each iteration only keep the tokens at the inpainted positions.
Results for Inpainting and Outpainting
3
Storing Inpainting and Outpainting Results as Weights & Biases Tables (Click to Expand) 👉
Structural Morphing
Structural morphing is defined as the act of changing the semantic content of an existing image while keeping the overall structure of the image the same. This is achieved by encoding the base image to its latent representation followed by noising a certain amount of tokens (such as 75%). This representation is fed as the initial input to the sampling process together with the conditional embedding. Another important change is to use a later starting point for the time embedding to indicate that the image is not full noise.
Results for Structural Morphing
46
Storing Structural Morphing Results as Weights & Biases Tables (Click to Expand) 👉
Image Variations
Since Paella was solely trained on text embeddings, conditioning the model on image embeddings from CLIP doesn’t work sufficiently and results in repeating simple patterns of the conditioned image. In order to still use Paella for this task, we can finetune 40k steps on image embeddings extracted from the visual part of CLIP. After finetuning, we can extract the image embedding from a given image and feed this to Paella for sampling. The resulting outputs are semantically very similar to the original image, however naturally have some distinct differences in terms of position, alignment, scale, etc.
Results for Image Variations
2
Storing Image Variations Results as Weights & Biases Tables (Click to Expand) 👉
Conclusion
In this article, we explored Paella, a fast text-conditional image generation model proposed in the paper Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces. The main contributions of the paper are:
- A novel training objective for text-to-image generation based upon discrete (de)-noising in a quantized vector space, which uses a parameter efficient fully convolutional network.
- A simplified and improved sampling scheme over previous work, which is capable of sampling high-quality images using a small number of steps.
- An exploration of the capabilities of Paella for downstream applications like image variations, inpainting & outpainting, latent space interpolation, multi-conditioning, and structural editing.
We also dived into the pseudocode behind the downstream applications in detail along with their results. Then, we discussed how these results can be stored and reproduced using Weights & Biases. If you want to learn more, we recommend checking out the following reports.
Unconditional Image Generation Using HuggingFace Diffusers
In this article, we explore how to train unconditional image generation models using HuggingFace Diffusers and we will track these experiments and compare the results usingWeights & Biases.
How To Train a Conditional Diffusion Model From Scratch
In this article, we look at how to train a conditional diffusion model and find out what you can learn by doing so, using W&B to log and track our experiments.
Improving Generative Images with Instructions: Prompt-to-Prompt Image Editing with Cross Attention Control
A primer on text-driven image editing for large-scale text-based image synthesis models like Stable Diffusion & Imagen
Stable Diffusion Settings and Storing Your Images
In this article, we explore the impact of different settings used for the Stable Diffusion model and how you can store your generated images for quick reference.
A Technical Guide to Diffusion Models for Audio Generation
Diffusion models are jumping from images to audio. Here's a look at their history, their architecture, and how they're being applied to this new domain
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.