898 Tootsie Soft-Raccoon
Created on March 26|Last edited on May 12
Comment
The idea was that maybe our model was not sufficiently cooled down. Our final LR from monumental-jellyfish (aka tootsie-phase3) was 1.7e-4, which is closer to Olmo 2 7b's peak LR of 3e-4 than to its final LR of 3e-5. We also saw evidence that our model had lower confidence in general: we needed a lower temperature for alpaca eval to work well etc.
Soft-Racoon
So we continued the same mix with the LR starting from (approximately) the same place. We annealed the LR from 1.7e-4 to 1.7e-5. This is tootsie-8b-soft-raccoon3 (gray as of this writing). (The first two soft-raccoons had uninteresting config problems)
Loss decreases (both held out tulu sft data and the training mix) for most of the run but ominously at the end the loss started increasing... This started around 2.2e-5.
XXX link to SFT on soft raccoon. Overall tt performed better on tulu during SFT, and alpacaeval went up, but was still quite bad (~4)
Softer-Raccoon
We decided to lower the LR even more since things got better, but that ominous increasing training loss (and validation loss!) trend increased! Oh no!
We tried to diagnose:
- Was it weight decay taking over when the gradient was low? No, zeroing out weight-decay had no effect. (tootsie-8b-softer-raccoon-no-decay)
So who knows. We're moving on to hypnotic-spoonbill (TODO) which includes some tulu data and flan (and a higher min LR) but otherwise looks like soft raccoon. Fingers crossed.
Run set
5
Add a comment