Stability AI Releases Deep Floyd IF
Stability AI's newest innovation
Created on April 28|Last edited on April 30
Comment
Stability AI has just released their text-to-image model called Deep Floyd-IF, with takes inspiration from Imagen.
Imagen was described last year by the Google Brain Team, and emphasized using a more advanced language model trained solely on text to improve image generation capability.
Stability AI, known for its ability to reconstruct and improve closed-source AI technology has stepped up to the plate, and implemented the research for the public, along with a few improvements. They call their model Deep Floyd IF.
Based On Imagen
Imagen is a text-to-image synthesis model that combines the strengths of pretrained text encoders, such as the T5 LLM, and robust cascaded diffusion models to generate photorealistic images based on text descriptions.
By leveraging the deep understanding of language from large language models and the high-fidelity image generation capabilities of diffusion models, Imagen achieves an unprecedented degree of photorealism and image-text alignment. It improves over models like Stable Diffusion by utilizing a modular approach that combines powerful text encoders with a pipeline of base and super-resolution diffusion models. Essentially, there are multiple diffusers that first generate very small images, and these images are gradually scaled up using the other diffusion modules in the pipeline. This combination enables Imagen to generate images with better coherence and detail.

Enhanced Performance on Text
Deep Floyd excels at generating images that require a more advanced understanding of language intelligence. Typically image generation models use text encoders that are trained on image-text pair datasets. However, Deep Floyd and Imagen utilize text encoders trained solely on text data, which allows more expressibility in their resulting embedding vectors.
This improvement results in images that have higher levels of detail, particularly when displaying text within the image. Overall, IF-4.3B model achieves a state-of-the-art zero-shot FID score of 6.66, outperforming both Imagen and the diffusion model with expert denoisers eDiff-I.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.