898 Tootsie Soft-Raccoon

Created on March 26|Last edited on May 12
Comment
﻿
﻿
GH Issue 898﻿
Starting from Tootsie 8b Monumental-Jellyfish we did a deeper cool down. 
﻿
﻿
The idea was that maybe our model was not sufficiently cooled down. Our final LR from monumental-jellyfish (aka tootsie-phase3) was  1.7e-4, which is closer to Olmo 2 7b's peak LR of 3e-4 than to its final LR of 3e-5. We also saw evidence that our model had lower confidence in general: we needed a lower temperature for alpaca eval to work well etc.
﻿
Soft-RacoonSo we continued the same mix with the LR starting from (approximately) the same place. We annealed the LR from 1.7e-4 to 1.7e-5. This is tootsie-8b-soft-raccoon3 (gray as of this writing). (The first two soft-raccoons had uninteresting config problems)
﻿
Loss decreases (both held out tulu sft data and the training mix) for most of the run but ominously at the end the loss started increasing... This started around 2.2e-5.
﻿
XXX link to SFT on soft raccoon. Overall tt performed better on tulu during SFT, and alpacaeval went up, but was still quite bad (~4)
﻿
Softer-Raccoon We decided to lower the LR even more since things got better, but that ominous increasing training loss (and validation loss!) trend increased! Oh no!
﻿
We tried to diagnose:
Was it weight decay taking over when the gradient was low? No, zeroing out weight-decay had no effect. (tootsie-8b-softer-raccoon-no-decay)
Was it a stale adam state? Nope. (tootsie-8b-softer-raccoon-reset-adamw-2)
﻿
So who knows. We're moving on to hypnotic-spoonbill (TODO) which includes some tulu data and flan (and a higher min LR) but otherwise looks like soft raccoon. Fingers crossed.
﻿
﻿
Run set5
﻿
﻿
Add a comment