Skip to main content

Report : SD Finetuneing

This report describes the
Created on August 7|Last edited on August 8

Introduction

Components of Stable diffusion

  • A text encoder that projects the input prompt to a latent space.
  • A variational autoencoder (VAE) that projects an input image to a latent space acting as an image vector space.
  • A diffusion model that refines a latent vector and produces another latent vector, conditioned on the encoded text prompt
  • A decoder that generates images given a latent vector from the diffusion model.

  • Used everdream training scripts for this project. Please refer the share google colab notebook for more details.
  • I got colab pro subscription and increased my gdrive space for this project. In hindsight, the pro subscription wasn't necessary.

Dataset

  • Used 500 images from the unsplash lite dataset.

  • Most of the images are single objects or landscape photos.
  • The images are in different aspect ratios.
  • Almost all objects could be found in the sd1.5 model

Captioning

  • As per the given instructions, I used a BLIP model to extract the transcripts.
  • 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_caption_capfilt_large.pth'is the model being used.
  • Created a sample of 5 images, to get the right parameters for captioning.
  • Some of the examples are listed below.
With the default parameters.
q_factor = 1.4
min_words = 30
  • Also added a unsplsh photography at the end of each prompt.
  • Check out the data captioning notebook for more details.
  • Finally, compressed the image size for faster processing.

Training

  • Used the Every2DreamTrainer framework for finetunening the models.
  • It can resize images on the fly (bicubic), and crop jitter feature makes sure the images are consistent.
  • Training was done in two stages. First a model was finetuned for 30 epochs. And then for 10 epochs.
  • Stopped training in between for the second stage, as it was clearly overfitting.
Training Parameters
  • Used learning rate 1e-06.
Validation Paramets
  • Used 10% of the dataset for validation. At 300 steps, which is around 4-5 epochs.
  • It looks like the model could use a bit more training.

Loss Graph


Run set
2


Results

  • Evaluate the results of the fine-tuning experiment.


Run set
2

  • Compare the results to the results of the base model.
trained model : a bird standing on the edge of a body of water in the middle of a body of water
base model : a bird standing on the edge of a body of water in the middle of a body of water
  • Please check the 100 generated samples from the given data location ( `base` and trained are the image folders, can find them in the main data folder). I manually go through the images and check for model performance. Specifically looking for burnt features or low level/ smoothed out details.
  • I would like to test these models a bit further. Like, an x/y/z plot of cgf, denoising steps. I think the model is trained quite well for the finetuned trained model.

For example, The 27th epoch's checkpoints are giving results as the ones below.
a black and white dog laying on top of a grass covered field with trees in the background
  • Both the other models got it a bit off
This is of epoch (30+4). Looks a bit burnt.
The base model generated a good image. But the contextual understanding of the prompt is missing.
  • I have not come across a metric, which can help in comparing finetuned model with the base sd 1.5 model, qualitatively. FID and Clip score would be a technical comparison, have to investigate on this a bit.
  • I have generated a set of 5 sample prompts to validate the model training itself. They are displayed in the previous section.
  • One way to find a metric is to run an object detector on all these images (for each model; TM, BM, OD). Get the latent embedding for the crops. Cluster these embedding at a class level (can use PCA). Find similar clusters for images generated by base model(BM) and the original dataset(OD). Find the KL divergence between (BM, TM) and (OD,TM). This gives us a metric on how good our model is trained with respect to base model and the original dataset.

Notes on improvement.

  • Improving captions is the next step. Using weighted tags will help in better training. In the sampled dataset. Most of the images are in the category of animal, humans, landscapes. Including these observations as a part of the prompt's hierarchy will help.
  • Using a bigger base model.
  • Check relevant negative embeddings.

References