SmolLM2-135M-Instruct vs SmolLM2-360M-Instruct

Created on December 5|Last edited on December 5

Comment

﻿
TaskBinary classification for hallucination based on query, context & output
Predict 1 token "True" or False" (plus eos token)
Briefly tried 135M-base but wasn't having much luck
approx 45k samples in the training data, avg ~1k tokens per sample, ~450k tokens per epoch
﻿
"Why don't you use a bert-based model of add a sequence classification head?"
I'd like to try maintain easy compatibility for downstream LLM serving libraries, plus curious how lazy I can be. 
﻿
Hyperparameter searchsSearching over
Learning rate: 1e-3, 1e-4, 1e-5
epochs: 3,5,7,9, 11
warmup ration: 0.02, 0.05, 0.01
360M - F1 score is strugglingBelow are the average metrics grouped by model and learning rate. Variation in the distributions is due to experiments with different warmup ratios & number of epochs.
A learning rate of 1e-3 worked best for both models, both by F1 score, eval loss and train loss.
﻿
Run set32
﻿
﻿
Ungrouped metricsFiltered for F1 > 0.52
﻿
Run set32
﻿
﻿

Add a comment