What's wrong with BPE?

Scaling the transformer and it trains BPE.
Created on March 2|Last edited on March 6
Comment
Original problem: BPE training dead. Checked the data and it seems really nothing is wrong: I can reconstruct the same token sequence information from the BPE sequence. 
﻿
Solution: tried applying bigger model and it kinda works. Same model as the purple one in run set 2 (2layers, hid_dim=256, heads=8, ff_dim=1024,  2.3M params) But observing the training trend it's also pretty weird, like being dead for a while and suddenly learns.  
﻿
Run set 2:
The effect of large model on the no-BPE experiments: it seems like scaling the model is not always great. Larger model develops faster in the beginning, but they overfits at earlier stage. 
Original-green: 2layers, hid_dim=128, heads=4, ff_dim=128 ---- only 360K params, a really mini transformer. 
lightgreen: 2layers, hid_dim=256, heads=4, ff_dim=128
purple: 2layers, hid_dim=256, heads=8, ff_dim=1024. Starts off nicely, but converge to much worse places than original. 
pink: 6layers - doesn't learn at all. This is also similar to the graph case, probably the number of layers should be restraint. (but the original transformer, as well as BPE paper used, is much larger than ours now - not sure if it's because of data.)
﻿
﻿
﻿
Section 1﻿
 
Run set 20
Run set0
﻿
﻿
﻿
Run set 20
﻿
﻿
Add a comment