Skip to main content

SmolLM2-135M-Instruct vs SmolLM2-360M-Instruct

Created on December 5|Last edited on December 5

Task

  • Binary classification for hallucination based on query, context & output
  • Predict 1 token "True" or False" (plus eos token)
  • Briefly tried 135M-base but wasn't having much luck
  • approx 45k samples in the training data, avg ~1k tokens per sample, ~450k tokens per epoch

"Why don't you use a bert-based model of add a sequence classification head?"
  • I'd like to try maintain easy compatibility for downstream LLM serving libraries, plus curious how lazy I can be.


Hyperparameter searchs

Searching over
  • Learning rate: 1e-3, 1e-4, 1e-5
  • epochs: 3,5,7,9, 11
  • warmup ration: 0.02, 0.05, 0.01

360M - F1 score is struggling

Below are the average metrics grouped by model and learning rate. Variation in the distributions is due to experiments with different warmup ratios & number of epochs.
A learning rate of 1e-3 worked best for both models, both by F1 score, eval loss and train loss.

Run set
32



Ungrouped metrics

Filtered for F1 > 0.52

Run set
32