A Gentle Introduction to DeepFloydAI's New Diffusion Model IF

In this article, we dive into DeepFloydAI's IF, a new diffusion model which can write cogently — instead of devolving into nonce words, like most existing models.
Justin Tenuto
Created on February 16|Last edited on May 3
Comment
On April 28 Stability AI and DeepFloyd released DeepFloyd IF, a state-of-the-art text-to-image diffusion model. This piece is meant as a less technical introduction to its companion piece. You can find that here: 
DeepFloydAI: A New Breakthrough in Text-Guided Image Generation
In this article, we explore DeepFloydAI — an AI Research Band which is working with StabilityAI to make AI open again. 
﻿
What is DeepFloydAI’s IF?DeepFloydAI falls under the umbrella of StabilityAI, famous to most readers for the Stable Diffusion text-to-image model we’ve covered a few times in the space, as well as Harmonai, a generative model for creating high-fidelity audio samples. DeepFloyd has just released IF, a new text-to-image model that's notable for a few important reasons. 
The first thing worth knowing? IF is open source, whereas recent successes in generative modeling, like DALL-E 2 and Imagen, notably are not. This alone is laudable. 
Past that, what’s incredibly cool is that IF can spell. Really. Most existing diffusion models are capable of producing high-quality generative images, but they’re pretty bad at rendering readable text. The letters and fonts often look fine, but the words read like they’re from Lewis Carroll’s Jabberwocky. Here’s DALL-E 2 trying to make an album cover for “Night Panther” 
﻿
Not ideal. 
DeepFloyd’s output here is markedly better: 
﻿
Part of the reason IF can write cogently instead of devolving into nonce words is that there’s a powerful language model inside the architecture, and there are other, more technical details practitioners are likely curious about. Still, we’re covering those in our companion piece. 
What Else Makes IF Different? Beyond the fact that it’s open source and it wouldn’t fail the first round of an elementary school spelling bee IF boasts better performance on nuances that other generative models typically struggle with. 
Here, we’re talking about spatial awareness and composition. If you prompt some diffusion models with specific instructions about what objects are in front of what other objects or what material they’re made of, they often struggle. This is especially true for complex prompts with multiple objects described with multiple adjectives. They can often be mixed up or sometimes ignored altogether. 
The researchers who trained IF used less style data than some other generative models, so if you’re looking to create an anime Abe Lincoln, you’ll want to look elsewhere. 
Lastly, a lot of attention was paid when training IF to make it safe. Generative models have significant potential to create harmful or explicit content. The researchers here walked us through the laudable steps they took to remove racy or violent imagery from their training data. This is never a bad idea, and doubly so for open-source efforts. 
Speaking of data, part of the reason IF is different can be told by how it was trained:
How Was IF Trained? The researchers at DeepFloyd were really clever and thoughtful with how they trained their model. We want to talk about two underlying datasets and some of the information they removed from those datasets that helped make this model so powerful. 
First up: LAION. LAION is a massive, 5B+ dataset of image-text pairs. But like any massive dataset, it isn’t perfect. The researchers here used grading to select higher-quality images and removed the ones with lower aesthetic scores (read: the ugly ones). They also found some images in LAION appeared thousands of times, things like logos or stock t-shirt models, with little to no significant differences. Images like those can be effectively memorized by a model, leading to generations that are functionally identical. And as we mentioned in the prior section, they also removed a whole host of explicit images to dramatically decrease the model’s likelihood of toxic or unsavory generations. This reduced the overall training dataset from 5 billion images to roughly 1 billion. 
The second dataset worth mentioning here is CLEVR. CLEVR is much smaller and much different than LAION. It contains exclusively images like the one below: 
﻿
Why is this important? Images like these help with the spatial and compositional nuances we talked about previously. There’s a ton of information packed in here, from the material of the shapes to how each is oriented in relation to the other and our vantage point as an observer. Understanding that there’s a shiny sphere furthest away from us or that there’s a blue, matte cube in the foreground, or that there are two cylinders on opposite sides made of different material really helps the model learn to generate specific spatial and material prompts. 
That careful dataset recipe is a big part of the reason IF performs better on the exact tasks we mentioned in our introduction. 
What Else You Should Know About IFIf you’d like more technical info, check out our companion piece about the training regiment, architecture, and more. ﻿
But for all other things DeepFloyd, we recommend heading here: https://linktr.ee/deepfloyd ﻿
You'll see links for their various social accounts as well as their community Discord and Github codebase as well. 
﻿
Add a comment
Tags: Articles, GenAI, Image Generation, Diffusion, Computer Vision, Beginner
Iterate on AI agents and models faster. Try Weights & Biases today.