neftune for summarization - longt5
Small experiment to see if NEFTune improves summarization model generalization.
Created on November 13|Last edited on November 13
Comment
about
We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.
what
- there is a new technique/method for training called NEFTune that adds noise to embedding vectors, which results in better generalization (of autoregressive LMs on instruct tasks)
- summarization models have very beeg generalization problem. huge
- idea: let's try this technique for summarization fine-tunes and see if it helps!
tldr
I think it's hard to tell and requires more experimentation.. It's hard to tell if a) it helps or b) it doesn't, either it needs to be much higher for summarization models or longt5-base is too small for it to make a difference.
dataset
initial version of my 'summary souffle'
lay_plos 20789multi_news 11708big_patent 4164gov_report 3514summ_screen_fd 3449billsum 2541lay_elife 2528booksum 2383cnn_dailymail 1705stacksmol 450qmsum 396squality 200xlsum_en 118worldbank 90narrativeqa 49dialogsum 3Name: subset, dtype: int64
size:
DatasetDict({train: Dataset({features: ['text', 'summary', 'subset'],num_rows: 54087})validation: Dataset({features: ['text', 'summary', 'subset'],num_rows: 4262})test: Dataset({features: ['text', 'summary', 'subset'],num_rows: 4202})})
Section 1
eval
Run set
13
train
Note: in the paper (for autoregressive LMs) they mention that training loss goes up with neftune, but I don't think that is really observable in the the below, even at six times the recommended value (for autoregressive LMs) --> 0.6 ??
Run set
13
Run Comparer
Outside of the one run without NEFTune, the others were me trying every idea I could think of to train faster (torch compile, etc) ... nothing works
Run set
13
Add a comment