Bghira's Search for Reliable Multi-Subject Training

Problem: Characters tend to meld together. Proposal: Explore captioning and prompting after exploration of hyper-parameters.
Bagheera
Created on August 28|Last edited on August 28
Comment
﻿
BackgroundThis training run was executed after extensive testing of:
- PEFT LoRA rank 16 through 128 covering the single and joint DiT blocks qkv and feedforward
- LoKr sized from 50M to 150M parameters at depths ranging from simply across the Attention/FeedForward to as far as covering the entire model
- Optimisers such as AdamWBF16, StableAdamW, adding Kahan summation and ScheduleFree learning, among other tricks
﻿
ParametersThe selected parameters for this run ended up being:
Lion optimiser, 
OPTIMIZER=optimi-lion
5e-6 learning rate
Effective batch size of 6 across 3x 4090 GPUs
batch size of 2 per GPU
Base model parameters in int8-quanto precision
Trained LoKr parameters in pure bf16 precision, relying on Kahan summation to avoid rounding bias
﻿
With this, the training runs at ~6.27 seconds per step which is slower than running on just 1x 4090 due to P2P overhead.
﻿
DatasetBefore collecting the dataset, run several experiments on the base model to determine its knowledge of either subject:
Subjects that are well-known already will require fewer training samples
Subjects that are pretty baked-in will possibly require diverse training samples showing them doing various interactions with their environment
If it's "kinda there", it will need fewer training samples.
You want more subject images of your multi-subject concept that the model knows the least of.
For my model, 22 images were hand-selected from a search of cheech & chong on Google.
While saving each image, I typed a filename as the caption, which is a supported caption strategy for SimpleTuner.
During the captioning process, I focused on positional information from the images:
Who is in the scene
Where they are in the scene
If they are holding anything
If they are showing emotion
If anything annoying is in the image, describing this quality as well (eg. against a white background)
﻿
Positional informationWhen describing where two subjects are in relation to the scene, consider the following AI-generated sample:
(base model on left, checkpoint on right)
Validation prompt: a photo with cheech marin sitting to the right of tommy chong
This prompt works very well - but note something peculiar. The prompt here is requesting that the subject is on the right, but the image generates the subject on the left.
Now, should we swap the prompt around:
﻿
Validation prompt: a photo with tommy chong sitting to the left of cheech marin
The bias of the pretraining data is coming through. My training samples purposely showed a diverse representation of the order these subjects appear in - and despite not really knowing exactly who these subjects are, the base model has a strong idea of how they should appear once it does know them.
﻿
Here's how the result comes when we don't give any positional information - it is a luck of the draw. Typically, we will see one subject cloned:
﻿
Validation prompt: cheech and chong together in a photograph
But if you change resolution or seed, sometimes it will happen for the same prompt:
﻿
﻿
I found this too interesting!
﻿
Age biasOne problem of pretrained celebrities is the model has an opinion of their age. To control this, such tags as elderly and younger were added to samples where the ages of the subjects noticeably diverge from their "classic era" format.
This had somewhat desired results, but I think, not enough samples like this were included:
﻿
Validation prompt: young cheech and chong in a black and white photograph
﻿
Validation prompt: old tommy chong on a sitcom in the 1990s
Diversity / generalisation capabilitiesI didn't include any regularisation data, so my expectation based on previous experience was that this model would bleed heavily and fail to generalise in the end. However, I was pleasantly surprised.
﻿
Validation prompt: anime cheech marin
﻿
Validation prompt: anime tommy chong
﻿
Sample captionsThe captions I created look like this:
a 3d video game character avatar resembling tommy chong. the character is wearing a red bandana, his signature glasses, and a denim shirt. he is holding a comically large joint
a black and white photograph of tommy chong standing to the left of cheech marin. they are both smiling and wearing traditional hindustani clothing. the background wall is shiny and reflective for some reason
a classic cheech and chong photograph. cheech marin sits on the left, looking directly at the camera. tommy chong is on the right side, smiling vacantly and staring past the camera
a pair of crude cheech and chong slippers. the left slipper features the face of cheech marin wearing a blue fedora and the right slipper has the face of tommy chong wearing a red bandana
a simpsons version of cheech marin is seen from behind, looking at tommy chong, who is angry and yelling with his arms outstretched and fists clenched
﻿
Section 1﻿
Run: simpletuner-lokr-cheechandchong-Tue Aug 27 05:42:31 AM BST 20241
﻿
﻿
﻿
Run: simpletuner-lokr-cheechandchong-Tue Aug 27 05:42:31 AM BST 20241
﻿
﻿
﻿
﻿
Run: simpletuner-lokr-cheechandchong-Tue Aug 27 05:42:31 AM BST 20241
﻿
﻿
﻿
Add a comment