Extra: Shared/separate QK for the synthetic task

Comparing shared/separate qk transformer lm for the synthetic task

Created on January 28|Last edited on January 28

Comment

﻿
During the experiment phase of the synthetic task (section 3.1), we used a Transformer LM with Shared Query-Keys. We expected this model to solve the relatively simple synthetic task to perfection, but we found it difficult to train the model to convergence even when training for 750 epochs. When we instead used  a standard Transformer LM (i.e. with separate Queries and keys), the model consistently converges in 7-8 epochs.
The figure below illustrates several runs on identical setup, changing only the model seed. Out of five runs with shared query-key only one model converged after about 200 epochs. For our standard Transformer LM the three runs converge within 7-8 epochs.
﻿
﻿
﻿
train_loss
train_loss
200400600epoch246Train loss
Transformer LM synthetic task-seed-12345
Transformer LM synthetic task-seed-1234
Transformer LM synthetic task-seed-123
synt_full-attn_bs-64_n_eps-750
synt_full-attn_bs-64_n_eps-750
synt_full-attn_bs-64_n_eps-750
synt_full-attn_bs-64_n_eps-750
synt_full-attn_bs-64_n_eps-750
Run set8
﻿
﻿

Add a comment