(WIP) Transformer-VAE Performance overview.

How valid are the VAE interpolations. Made by Fraser Greenlee using Weights & Biases
Fraser Greenlee

Number of latent tokens

How does eval loss vary?

Compress Latent Tokens

Compressing latent tokens works better than using whole ones.

Extra VAE Regularisation Losses

Critic seems to be giving some improvement to interpolations. I think it would improve a model during fine-tuning. The critic must be used without 0.1 coefficient.

Larger Batch Size

Larger batches only seem to slow training down for some reason.

Smaller latent compression

d_model/2 gives the fastest learning.

All models have similar interpolation performance.

Same run but with different latent token counts.

Only the single token model improves overtime likely because the other models have too many latent units.

Looking at the decoder/ratio samples graph its unclear if a better performing model would have worsened in its latent samples.

Variations on the 1-token model.

Seems that large batch sizes lower regularisation too much to allow the latent-map to be effective.

Also shows that upsampling the latent encoding is important.

Currently it upsamples to 30 tokens but perhaps doing it to the original 30/4 tokens would be better?

Removing upsampling worsens random samples. Delaying regularisation also doesn't help.

How about compressing each token using a shared fc layer?

The Cmp5 run has the same number of latent units as the full1st token model.

Use dropout on a schedule could allow learning a high-dim representation that gradually gets compressed down.

Note: Sampling did not account for dropout so likely unfair.

Here both runs have the same number of latent tokens (on average). Clearly this dropout method isn't ready yet.

Maybe worth trying to consistently zero units in each latent token?

What if I instead learned several latent codes at once... would there be a way to gradually change those? Would they together take better advantage of the weights?

Could a skip-connection that's gradually reduced improve performance? I wonder if it would put less "pressure" on the latent code?

Interesting that the model with only a short-lasting skip connection gave the best random samples. Note that it has the worst interpolations I've ever seen.

Maybe worth trying a skip connection with a very slight multiplier?

Batch size VS training time

Note that using a larger batch size than 10 just seems to slow things down.

It learns at the same rate while getting through the epoch slower.

Interpolation quality VS auto-encoding performance.

In general performance on interpolations first improves with auto-encoding & then plateaus.

Interpolation quality is found by learning to auto-encode individual lines of Python code then measuring the % of interpolation values that are valid (including the start & end).

One thing to note is that if interpolation is being done towards an incorrect sequence it will very likely not be valid Python. Whats odd about the results is that models with < 50% accuracy have >50% valid interpolation points. This is generally due to the model producing short repetitive sequences that happen to auto-encode well.