Skip to main content

Dreambooth fgreeneruins-ruins

Quick Training Report for the Hugging Face DreamBooth fine-tuning hackathon. Concept : fgreeneruins : Forest ruins, greenery Theme : Landscape
Created on January 23|Last edited on January 26
Dreambooth is a technique to teach new concepts to Stable Diffusion using a specialized form of fine-tuning -- teaching the model a new word with the corresponding images. To help the model understand this new 'concept' (here "fgreeneruins") you also give the class of this concept (here "ruins") to the model.


Intro (from HF dreambooth report)

  • With Dreambooth, StableDiffusion overfits quickly. It's important to find the right learning rate (LR) and training steps for your dataset. Training a higher learning rate for less steps and training a lower learning rate for more steps gives very similar results. We have to find the 'sweet spot' training steps for a given learning rate to get reasonable images.
  • If you see that the generated images are noisy, or the quality is degraded, it likely means overfitting. First, try the steps above to avoid it. If the generated images are still noisy, use the DDIM scheduler or run more inference steps (~100 worked well in our experiments).
  • HF dreambooth experiments -> Doing EMA doesn't seem to make a difference.
  • Getting good results when training Dreambooth needs lot of tweaking


Experiments

Settings

  • Dataset : CCMat/db-forest-ruins
    • len : 17
  • Selected pretrained models:
    • prompthero/openjourney
    • nitrosocke/elden-ring-diffusion
  • AdamW optimizer
  • No prior preservation is used.
  • No fine-tuning the text encoder (for computational constraints)
  • kept all hyperparameters equal across runs, except LR, training steps and gradient accumulation steps
  • fixed hyperparameters include:
    • lr_scheduler : constant
    • resolution : 512
    • train_batch_size : 1
    • using 8bit optimizer from bitsandbytes
  • Learning rated tested : 2e-6, 1e-6


Learning Rate 2e-6

  • Images start to get noisy/degraded around step 500 -> model is overfitting
  • The images don't seem to really assimilate the concept before step 300
  • Decided to log samples every 17 steps (one pass of our training set) between step 300 and 442

Run set
4

  • tried different prompts :
    • "a photo of fforuins ruins"
    • "a photo of fgreeneruins ruins" -> seems to create "greener" images -> better
=> Most promising steps :
  • 340 - 357 - 374 for both pretrained models


Learning Rate 1e-6

  • Images start to get noisy/degraded after step 800 -> model is overfitting
  • The images don't seem to really assimilate the concept before step 400

Run set
3

  • tried different prompts :
    • "a photo of fforuins ruins"
    • "a photo of fgreeneruins ruins" -> seems to create "greener" images -> better
  • Increasing gradient accumulation steps to 2 seems to give less clear images and overfits more quickly
    • seems harder to find the correct hyperparameters settings
=> Most promising steps : between 700 - 800
After a doing a comparison of the outputs of the models at these steps :
  • prompthero/openjourney : step 782


Comparison of best results







Best Results

  • 2e-6 :
    • openjourney - step 357
    • elden-ring - step 340
  • 1e-6 : openjourney - step 782



Summary

To get good images that incorporate well the concept without degrading other objects, it's important to:
  • Tune the learning rate and training steps for your dataset.
    • High learning rates and too many training steps will lead to overfitting (in other words, the model can only generate images from your training data, no matter the prompt).
    • Low learning rates and too few steps will lead to underfitting, this is when the model can not generate the trained concept.
  • The image quality degrades quite a lot if the model overfits and this happens if:
    1. The learning rate is too high
    2. We run too many training steps
  • Adding gradient accumulation steps doesn't seem to improve the quality of the images