Skip to main content

DeepFloydAI: A New Breakthrough in Text-Guided Image Generation

In this article, we explore DeepFloydAI — an AI Research Band which is working with StabilityAI to make AI open again.
Created on February 15|Last edited on May 2
Text-to-image generation models have taken the world by storm in recent times. Never before have machine learning models enjoyed such mass appeal!
The fact that you can type in some text and get accurately realistic renditions of exactly what you say is almost a magical experience. This gives text-to-image generation models some immense potential to not only be picked up and used by anyone without an artistic background but also provide artists with an immensely powerful paintbrush that can be used to create breathtakingly complex art without much effort.
Currently, the most popular text-guided image-generation models include:
Here are some examples of images generated by IF, a new open-source text-to-image model from DeepFloydAI:


Examples of image generated by contemporary text-to-image generation models available publicly
21

However, if we observe carefully, we would notice that none of these models can generate accurate renditions of text based on the input prompt. This is a major problem holding back the adoption of such models for commercial use cases like the generation of advertisement posters, brand logo design, etc.
If you wish to know how you can log your Stable Diffusion generations into Weights & Biases Tables for ease of reproducibility and visualization, you can check out these reports


Where's the Parti Tonight?

Google introduced the model Parti: Pathways Autoregressive Text-to-Image in the paper Scaling Autoregressive Models for Content-Rich Text-to-Image Generation in June of 2022, which aims to solve this issue. Parti achieves high-fidelity photorealistic image generation and supports content-rich synthesis involving complex compositions and world knowledge, which includes all kinds of features like text, maps, and abstract concepts like "Infinity"!!!
Parti treats text-to-image generation as a sequence-to-sequence modeling problem, analogous to machine translation – this allows it to benefit from advances in large language models, especially capabilities that are unlocked by scaling data and model sizes. In this case, the target outputs are sequences of image tokens instead of text tokens in another language.
Parti uses the powerful image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. It takes advantage of its ability to reconstruct such image token sequences as high-quality, visually diverse images.



Here are some examples of images generated by Parti (results were collected from parti.research.google)
7



More A-MUSE-ment...

Google recently introduce another model Muse: Text-To-Image Generation via Masked Generative Transformers, which also aims to tackle the same issue.
Muse is trained on a masked modeling task in discrete token space:
given the text embedding extracted from a pre-trained large language model (LLM), predict randomly masked image tokens.
Compared to other pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations. Muse is also more efficient compared to autoregressive models like Parti due to the use of parallel decoding.
The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality, etc.



Here's some images generated by Muse (results were collected from muse-model.github.io)
1


However, like Imagen, another large-scale text-to-image generation model developed by Google, Parti, and Muse has not been open-sourced or made available to the public. This means that right now, we don't have a model that can be used to create not only accurate renditions of images from prompts but also text...

IF I Were A Swan...

DeepFloydAI, a company that works with StabilityAI (the creators of Stable Diffusion), has recently announced a new text-to-image generation model called IF. As we can see from their announcement on Twitter, not only does the model produce really nice and coherent results when it comes to renditions of text, but also the model is going to be open-source!




Images generated by DeepFloydIF containing accurate renditions of text
21


A Deep Dive Into DeepFloydAI's Journey

Ever since OpenAI introduced GLIDE in 2022, it was clear that diffusion-based models have a huge potential with respect to their application in high-fidelity image generation.
Architectures such as Imagen, DALL-E 2, and Stable Diffusion have since proved that diffusion-based models could achieve zero-shot text-to-image generation with high fidelity. Let us look at how a text-guided diffusion model for image generation works at a high level.

A Gentle Introduction to Text-Guided Diffusion

A text-guided image generation model such as Imagen or Stable Diffusion is made up of a couple of basic components. These typically include a text encoder which encodes the text to produce token embeddings and a diffusion-based image generator which generates the image conditioned on these embeddings.


The text encoder in this case is a Transformer-based model used as the language understanding component of the architecture. Different architectures might use different text encoders. For example, Imagen uses T5, a generic large language model trained on text-only corpora. In contrast, Stable Diffusion uses CLIPText, the text encoder of a multi-modal vision and language model.

The Image Generator

The Image Generator consists of 2 main components,
  1. The Diffusion model iteratively generates a low-level latent representation of the image from a random latent tensor. This usually refers to a Unet model coupled with a denoising defusion scheduler.
  2. The Image Decoder creates the final image from the latent representation generated by the Diffusion model. This is usually a combination of a vardo iational autoencoder that generates a low-resolution image, which is enhanced using a super-resolution model.


Note that unCLIP, which is the underlying architecture for DALL-E 2 is a little different from Imagen and Stable Diffusion. Both Imagen and Stable Diffusion make use of pre-trained text encoders instead of training them from scratch using a frozen encoder like unCLIP, which makes them more modular compared to unCLIP.
💡


What IF Does For the Text Encoder

The researchers at DeepFloydAI initially drew inspiration from eDiff-I, a text-to-image Diffusion Model from NVIDIA, that uses a combination of T5 and CLIPText for image encoding. However, it turns out that this approach, in spite of being computationally more expensive, doesn't yield significantly better results in terms of both Frechet Inception Distance (FID), CLIP score, and qualitative analysis.
The researchers also experimented with CLIPText and CLIPImage encoder and note that this approach has a huge advantage, i.e., since CLIP is a vision-language model (unlike T5, which is trained on purely text-based corpora), it can act as an additional knowledge base for the model.
For example, if the dataset that IF is being trained on doesn't have the works of a painter, it can still be applied for a generation if CLIP already has been trained on it. However, this approach is not very good for achieving a good FID and CLIP score.
The researchers use a purely text-based model for the text-encoder which is larger than the text-encoder of Imagen.
If you want to know more about FID score, you can check out the following report


What IF Does For the Image Generator

The researchers attempt to implement a specific variant of the UNet model for diffusion inspired by that of Imagen, where the image is progressively upscaled after the text-conditional diffusion using a diffusion-base super-resolution model. The researchers note that this kind of architecture for image generation is not only faster to converge but also cheaper in terms of visual memory footprint.



Note that IF's image generator architecture is not exactly similar to that of Imagen in spite of all the similarities, for example, the researcher of DeepFloydAI use of a lot more cross-attention which improves the results significantly for IF.
💡


Progressive Generation Using Pixel-Cascading

As noted earlier, the researchers from DeepFloydAI follow the pixel-cascading approach of progressively upscaling the generated image to higher resolutions (64x64 -> 256x256 -> 1024x1024).
Similar to Imagen, noise conditioning augmentation was used between these stages of upscaling, and the researchers note that this is crucial to achieving good FID and CLIP scores because it cannot be guaranteed that the generative images will be from the same demand domain as the training data during the upscaling stages. Hence, a Gaussian noise or blur is applied to the low-resolution images to corrupt them.

Size of the UNet

The authors of Imagen propose that the fidelity of generated images does not depend on the number of parameters of the UNet, but rather on the size of the text-encoder model. However, the researchers from DeepFloydAI found during their experiments that increasing the size of the UNet to 2x that of Imagen was necessary to achieve better results compared to both Imagen and Parti.


Some more goodness from DeepFloydAI
4



How was IF trained and evaluated?

The researchers tracked all kinds of metrics during the training and evaluation of IF using Weights & Biases, which enabled them to make a lot of crucial decisions during the development of the model and various experiments. Let us try to understand how the researchers evaluated their model, which enabled them to make crucial decisions regarding the architecture to achieve better FID and CLIP scores than SoTA models like Image and Parti.
As we have discussed earlier that FID and CLIP scores are 2 objective metrics for evaluating text-guided image generation models. However, it's not really possible to do it very frequently because computing these metrics is computationally expensive and time-consuming (it might take as long as a day for a single FID evaluation!!) when done at the scale of the datasets used to train IF. This led the researchers from DeepFloydAI to investigate some alternate metrics which could be computed cheaply and also give some insights about the training.
Note that, unlike training GANs, evaluating diffusion models on the fly is actually easier because the loss for the diffusion model actually goes down, unlike GANs. However, evaluating diffusion models comes with its own set of challenges. In case you’re performing continuous or discrete diffusion, several steps need to be selected on which inverse diffusion needs to be performed.
Depending upon the training step, the loss might be pretty different in close proximity. Even with a large batch size of 3,072 (which is 50% more than that of Imagen), it's difficult to infer some real insights due to the stochasticity of the process.
The researchers solve this problem by selecting specific steps on the evaluation dataset and trying to predict the next step (or the previous step in case of inverse diffusion). This way, the researchers can calculate a less stochastic form of the evaluation loss which is more insightful and computationally cheaper compared to the raw evaluation metrics.
The authors used Weights & Biases for controlling and visualizing all their experiments, which helped them make a lot of crucial decisions observing the behavior of both the training and distilled evaluation losses.

Weights & Biases Posters generated using DeepFloydIF
1


Why Does IF Render Text Better?

As we have discussed earlier that the researchers use a large pre-trained language model as the text-encoder in IF's architecture. This is one of the features of IF that helps its renditions of text in the generated images much better compared to similar models. The large LLM not only enables IF to have a better understanding of text in the generated images but also the cardinality and composition of different objects of different textures and materials in a spatial manner.
While models such as Imagen and Stable Diffusion are trained on the LAION datasets, the researchers also used the CLEVR dataset, which is a synthetically generated diagnostic dataset for compositional language and elementary visual reasoning, which played a big role in the model's understanding of the cardinality and composition of different objects.
In order to know more about how the researchers from DeepFloydAI collected and processed their data for training IF, you can check out


Conclusion

  • In this article, we discussed primarily IF, a state-of-the-art text-to-image generation model from DeepFloydAI, an AI research company working with StabilityAI, with the goal of making AI open again!
Source: @deepfloydai
  • We also discuss contemporary text-to-image generation models such as Imagen, Stable Diffusion, Parti, Muse, DALLE-2, etc., and discuss how these existing works influenced the researchers from DeepFloydAI in the development of IF.
  • We briefly discuss some of the details regarding the architecture of IF, how the researchers drew inspiration from the recent breakthroughs in the domains of diffusion, and how after lots of experimentation, the researchers arrive at this particular architecture.
  • We also discuss how the researchers from DeepFloydAI utilized Weights & Biases for experiment tracking and visualizations, which enabled them to make a lot of crucial decisions during the development of the model and various experiments.
  • We discuss how the architectural discoveries, along with some CLEVR data collection and processing, enables IF to achieve state-of-the-art results with respect to both quantitative and qualitative metrics.
  • We also explore results generated by different text-to-image models (including IF) by logging them into Weights & Biases tables.
In order to learn more about image generation techniques, you can check out more such reports on our blog Fully Connected. 

Iterate on AI agents and models faster. Try Weights & Biases today.