SmolLM2-135M-Instruct vs SmolLM2-360M-Instruct
Created on December 5|Last edited on December 5
Comment
Task
- Binary classification for hallucination based on query, context & output
- Predict 1 token "True" or False" (plus eos token)
- Briefly tried 135M-base but wasn't having much luck
- approx 45k samples in the training data, avg ~1k tokens per sample, ~450k tokens per epoch
"Why don't you use a bert-based model of add a sequence classification head?"
- I'd like to try maintain easy compatibility for downstream LLM serving libraries, plus curious how lazy I can be.
Hyperparameter searchs
Searching over
- Learning rate: 1e-3, 1e-4, 1e-5
- epochs: 3,5,7,9, 11
- warmup ration: 0.02, 0.05, 0.01
360M - F1 score is struggling
Below are the average metrics grouped by model and learning rate. Variation in the distributions is due to experiments with different warmup ratios & number of epochs.
A learning rate of 1e-3 worked best for both models, both by F1 score, eval loss and train loss.
Run set
32
Ungrouped metrics
Filtered for F1 > 0.52
Run set
32
Add a comment