Skip to main content

pretraining efforts

Created on March 27|Last edited on March 27
In a variety of configurations, there is no phasic difference in the achieved test loss. Val is best for smaller ratios, indicating the existence of some harder trials; OTOH though val saturates before eval, we might be overfitting?

So, ball is in adaptation's court, if we really believe in pretraining.

Section 1


2003004005006007008009001k2k3k4k5k6k7k8k9k10k20k30k40k50k60k70k80k90ktrainer/global_step0.3
123456789102030405060708090100200epoch
2003004005006007008009001k2k3k4k5k6k7k8k9k10k20k30k40k50k60k70k80k90ktrainer/global_step0.5
Run set
5