aMUSing Images with WandB & Diffusers
See how you can get started with comparing and finetuning the aMUSEd model from Hugging Face Diffusers
Created on February 10|Last edited on May 15
Comment
IntroductionThe MUSE ModelText-to-image generation with aMUSEdSalient Features and Differences Compared to MUSEInteresting Experiments with aMUSEdConditional Image GenerationAs a single modelIn comparison with Stable Diffusion-XLIn comparison with Latent Consistency ModelsIn comparison with SSD-1BFine-tuning with some PokemonWith no changesWith Cosine Learning-Rate SchedulerWith a longer training durationComparing all runs togetherConclusion
Introduction
Text-to-image generation is all the rage in the current Generative AI landscape with new models coming out every week. In this environment, the class of Diffusion Models normally stands out as a staple due to their high-quality generation capabilities and how well they can handle a ton of different use cases. Due to this reason, it's also an interesting research challenge to develop models that differ from the traditional approach.
The MUSE Model
This paper was introduced by Muse: Text-To-Image Generation via Masked Generative Transformers in 2023 to demonstrate how one can use Masked Generative Transformers to perform Image Generation, Super-resolution, and Mask-Free Editing at the same time!
This model's novelty lies in its use of Masked Modelling and for its use of discrete tokens leading to fewer iterations of sampling being required. It was able to perform at part with current State-of-the-art Image Generation models with significantly reduced inference time.

The text embeddings are generated using a frozen T5-XXL model from Google, whereas a Base Transformer and SuperRes Transformer trained by the team for this task are used to predict the masked tokens.
We also end up noting that the model was not made open-source or released directly. Hence, taking inspiration from the original idea, a joint endeavor between Stability AI and Hugging Face created an open-source equivalent named aMUSEd.
Text-to-image generation with aMUSEd
The aMUSEd model takes inspiration from the previous MUSE model by also training on the masked-image-modelling task. Instead, it looks to change the T5-XXL model with the CLIP-L/14 text model, while also using a single U-ViT model instead of a separate conventional Transformer for Base Masked-Token prediction and Super-resolution.
The model iteratively predicts more tokens per step and improves its inference based on more information as it appears. Finally, the latents are passed into the VQ-GAN Decoder to get a final generated image.

Salient Features and Differences Compared to MUSE
- Uses CLIP-L/14 instead of T5-XXL
- Uses U-ViT instead of the Transformer
- Gets rid of the Super-resolution Transformer
Interesting Experiments with aMUSEd
We can do quite a few things with aMUSEd at the same time. Let's explore a few of these options here and how it fares against other similar models.
Conditional Image Generation
This task makes use of a single detailed prompt that is passed to the model to generate an image as described. For our experiments, we create a small dataset of interesting prompts to generate complex compositions of visual objects together in one setting.
As a single model
Let's stop the talk, and look at the images now.
Run set
1
If we see the images carefully, we can see that the larger context of the prompt has been followed and the model tries to generate coherent images even from abstract descriptions. This can be seen in samples 5 and 28. However, when given highly specific prompts, it is unable to model the information well as seen in samples 8 and 29. One can also say that the human generation is not particularly good enough although it can handle abstract concepts well.
In comparison with Stable Diffusion-XL
SDXL is generally considered a strong baseline for text-to-image generation tasks wherein we can quickly see what should general performance on the given prompts look like.
Run set
1
In comparison with Latent Consistency Models
Latent Consistency Models were introduced by a team from Tsinghua University, specifically built towards modeling the reverse diffusion process as an augmented probability flow ODE. This allows the model to predict the solution to the ODE in latent space itself, thus bringing results in far fewer steps in comparison to the general Diffusion process.
Let's compare aMUSEd to this and see how we fare.
Run set
1
We can see that the LCM has delivered far more consistent results when it came to abstract scenery, but the aMUSEd model did generally better at human face modeling. LCMs also seem to generate smoother images with more blunt colors in comparison to the aMUSEd model.
In comparison with SSD-1B
The Segmind Stable Diffusion - 1B model is a model that has been highly distilled from the original Stable Diffusion XL model. It is 50% smaller and up to 60% faster than the original model, but yields a close-enough result as that of SDXL. This was done by performing Mixed-precision Training and progressive distillation of the U-Net block and others to prune out layers that do not affect the output too much.
Run set
1
We can see that the aMUSEd model does better when it comes to composing different objects around each other in different positions (like the helmet on the person's head), but SSD-1B has far crisper image quality along with hyper-realistic generations that follow the text prompts very well.
Fine-tuning with some Pokemon
Now, let's see how we can make the aMUSEd model better at generating only a certain type of object through full finetuning. For our experiments, let's see if we can get it to create new Pokemons based on our prompts.
With no changes
We see that there are no major improvements in performance, while one could say that the model objectively does worse in generating the Pokemon figures now. We see random abstract artifacts on the image which make the generation look awry. This is also corroborated by our results during training, with no significant drop in loss.
Run set
1
With Cosine Learning-Rate Scheduler
We now try using the Cosine LR Scheduler to see if varying the Learning rate might bring us some benefits. The images generated have significantly reduced those artifacts which signals an improvement in results, but they are not completely gone yet. However, for the generation of these images, we can see that the model tries to adhere closely enough to the original instructions of the prompt.
Run set
1
With a longer training duration
Lastly, we see if a stable fine-tuning regime over a longer time yields any benefits. We can see that the images now have even fewer artifacts compared to the others we tried previously before, signaling that a longer number of steps can improve the model's generation capabilities objectively. We also see that the model reached its lowest step_loss value.
Run set
1
Comparing all runs together
If we take a comparative look between all runs, we can check out how the training loss trends similarly even with differences in hyperparameters across training regimes. This would indicate that the model has a "learning curve", and that further optimization may be necessary to overcome this.
Run set
3
Conclusion
We do quite a few things in this short and crisp report.
- A quick look at the aMUSEd model as an open-source reproduction of the MUSE model from Hugging Face.
- Compare it with currently-existing Open models available such as SDXL, SSD-1B, and LCMs.
- Identify where the model excels and where the model fails in trying to replicate and follow the prompt and how other models act better in those capabilities.
- Fine-tuning it on the Pokemon Dataset with three different regimes to see what yields interesting results.
Add a comment
look at the images now.
There's a problem with how the `autolog()` was used to generate the images. In a notebook setting, `wandb.finish()` should be called after each experiment.
Reply