Skip to main content

Dreambooth fluffalpaca-llama

Quick Training Report for the Hugging Face DreamBooth fine-tuning hackathon. Concept : fluffalpaca : Fluffy alpaca Theme : Animal
Created on January 23|Last edited on January 27
Dreambooth is a technique to teach new concepts to Stable Diffusion using a specialized form of fine-tuning -- teaching the model a new word with the corresponding images. To help the model understand this new 'concept' (here "fluffalpaca") you also give the class of this concept (here "llama") to the model.




Intro (heavily inspired from HF dreambooth report)

Experiments

Settings

  • Selected pretrained models:
    • prompthero/openjourney
    • runwayml/stable-diffusion-v1-5
    • stabilityai/stable-diffusion-2 (best results)
  • Dataset :
    • the first dataset used was a dataset of 13 images of alpacas
      • it was not diversified enough to produce good results
    • CCMat/db-aplaca (final)
      • len = 22
  • AdamW optimizer
  • No fine-tuning the text encoder (for computational constraints)
  • kept all hyperparameters equal across runs, except LR, training steps and gradient accumulation steps
  • fixed hyperparameters include:
    • lr_scheduler : constant
    • resolution : 512 / 768 (for stable diffusion 2)
    • train_batch_size : 1
    • using 8bit optimizer from bitsandbytes
  • Learning rated tested : 2e-6, 1e-6, 9e-7
  • Class for prior preservation : llama
  • Class dataset for prior preservation : CCMat/llama
    • len = 52
    • I created my own class dataset of llama as the pretrained models were generating bad images of llama when doing prior preservation


Without Prior Preservation

  • prompt : "A photo of fluffllama llama"
  • first dataset of 13 images of alpacas (It wasn't varied enough)

Mid Learning Rate 2e-6


Run set
2

  • the models have difficulties generating the faces of the alpacas
  • the bodies are a bit distorted / body proportions are not good


Low Learning Rates 1e-6 & 9e-7


Run set
3

  • the models have difficulties generating the faces of the alpacas
  • decreasing the learning rate to 9e-7 doesn't produce better results


Findings

  • without prior preservation it is hard to generate the faces of the alpacas correctly
  • Increasing gradient accumulation steps to 2 doesn't seem to improve the quality of the images
  • Decreasing the learning rate to 1e-6 seems to give the best results (at least it seems easier to find the 'sweet-spot', especially for the faces).


With Prior Preservation (better results)

  • better results especially for the faces of the alpacas
  • prompt : "A photo of fluffalpaca llama"
  • dataset : CCMat/db-aplaca
  • class_dataset : CCMat/llama
  • keeping gradient accumulation steps to 1
  • the pretrained models don't generate good images of alpaca or llamas, therefore I created my own class dataset of llamas for prior preservertation

Mid Learning Rate 2e-6


Run set
3



Run set
1

  • Difficult to generate good images of alpacas without overfitting


Low Learning Rates 1e-6


Run set
1

(don't mind the step number on image table above)


Run set
1



Run set
1

  • prior loss of 0.6 seems to give the best results
  • the models seem to overfit after step 1300
  • stabilityai/stable-diffusion-2 produces better images of 'alpacas' compared to runwayml/stable-diffusion-v1-5
Best model :
- stabilityai/stable-diffusion-2 : step 1034 - 1056 - 1078 - 1100


Findings

  • The pretrained stabilityai/stable-diffusion-2 generates the best images for my concept
  • A low learning rate of 1e-6 gives better result than a learning rate of 2e-6
  • With prior preservation, our models generate better faces for the subject
  • For the pretrained stabilityai/stable-diffusion-2 the best settings are the following :
    • learning rate : 1e-6
    • step : 1034 - 1056 - 1078 - 1100
    • prior loss : 0.6
    • class : llama
    • class_dataset : CCMat/llama
      • len : 52


Comparison of best results

  • Note : While training on this last run I added 8 images generated by Stable diffusion to the class_dataset (I added 16 but handpicked 8) for prior preservation

Run set
1




  • At step 1012 the model seems promising although the faces of the subject are still a bit distorted
=> best models :
- stabilityai/stable-diffusion-2 : step 1100
- stabilityai/stable-diffusion-2 : step 1078



Summary (heavily inspired from HF dreambooth report)

To get good images that incorporate well the concept without degrading other objects, it's important to:
  • Tune the learning rate and training steps for your dataset.
    • High learning rates and too many training steps will lead to overfitting (in other words, the model can only generate images from your training data, no matter the prompt).
    • Low learning rates and too few steps will lead to underfitting, this is when the model can not generate the trained concept.
  • 1e-6 with ~1100 steps seems to work well for the faces of our subject
  • If our model has difficulty generating faces without overfitting => use prior preservation
  • The image quality degrades quite a lot if the model overfits and this happens if:
    1. The learning rate is too high
    2. We run too many training steps
    3. In the case of faces, if no prior preservation is used
  • If the image quality is still degraded even after these changes:
    • Try different schedulers
    • Use more inference steps
  • A diversified dataset is important for fine-tuning Stable Diffusion with dreambooth (especially to generate a concept that belongs to a class Stable Diffusion has difficulty to generate).