Skip to main content

neftune for summarization - longt5

Small experiment to see if NEFTune improves summarization model generalization.
Created on November 13|Last edited on November 13

about

Relevant paper link
We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.

what

  • there is a new technique/method for training called NEFTune that adds noise to embedding vectors, which results in better generalization (of autoregressive LMs on instruct tasks)
  • summarization models have very beeg generalization problem. huge
  • idea: let's try this technique for summarization fine-tunes and see if it helps!

tldr

I think it's hard to tell and requires more experimentation.. It's hard to tell if a) it helps or b) it doesn't, either it needs to be much higher for summarization models or longt5-base is too small for it to make a difference.


dataset

initial version of my 'summary souffle'
lay_plos 20789
multi_news 11708
big_patent 4164
gov_report 3514
summ_screen_fd 3449
billsum 2541
lay_elife 2528
booksum 2383
cnn_dailymail 1705
stacksmol 450
qmsum 396
squality 200
xlsum_en 118
worldbank 90
narrativeqa 49
dialogsum 3
Name: subset, dtype: int64
size:
DatasetDict({
train: Dataset({
features: ['text', 'summary', 'subset'],
num_rows: 54087
})
validation: Dataset({
features: ['text', 'summary', 'subset'],
num_rows: 4262
})
test: Dataset({
features: ['text', 'summary', 'subset'],
num_rows: 4202
})
})

Section 1

eval


Run set
13


train

Note: in the paper (for autoregressive LMs) they mention that training loss goes up with neftune, but I don't think that is really observable in the the below, even at six times the recommended value (for autoregressive LMs) --> 0.6 ??

Run set
13


Run Comparer

Outside of the one run without NEFTune, the others were me trying every idea I could think of to train faster (torch compile, etc) ... nothing works

Run set
13