916 Tootsie Hypnotic Spoonbill

Created on March 31|Last edited on May 14
Comment
﻿
﻿
GH Issue 916﻿
Similar to the racoon runs but we wanted to use a higher final learning rate, with the hypothesis being that the LR was too low for some reason. We set the min LR to be just higher than where Soft-Raccoon started going up. Otherwise the cooldown was the same as Soft-Raccoon.
﻿
So #898 was moderately successful: cooling the model down more resulted in better SFT performance. However we hit what we thought was an LR floor below which the loss increased. various attempts to save it didn't work,.
We to tried another cooldown (starting from the same point:Tootsie 8b Monumental-Jellyfish) that wasn't quite so deep as deep-raccoon. We decided to treat 3e-5 as an LR floor (this is about when the loss started going up) and repeat “soft raccoon” with a decay to 3e-5 over 200B tokens with ~.3% tulu 3 and ~1% flan. In addition to the LR changes, we hoped that adding some SFT-ish data while we're cooling down will make the model want to be task-y.
﻿
Training Results﻿
Surprisingly, we observed the exact same increase in the loss at about the same point, even a little earlier. Tulu loss did go down a lot, as expected.
We  started running another run (spoonbill-norms-2) that logged gradient histograms. We will update this report when it comes back.
﻿
﻿
Run set7
﻿
﻿
Add a comment