Extra: Shared/separate QK for the synthetic task
Comparing shared/separate qk transformer lm for the synthetic task
Created on January 28|Last edited on January 28
Comment
During the experiment phase of the synthetic task (section 3.1), we used a Transformer LM with Shared Query-Keys. We expected this model to solve the relatively simple synthetic task to perfection, but we found it difficult to train the model to convergence even when training for 750 epochs. When we instead used a standard Transformer LM (i.e. with separate Queries and keys), the model consistently converges in 7-8 epochs.
The figure below illustrates several runs on identical setup, changing only the model seed. Out of five runs with shared query-key only one model converged after about 200 epochs. For our standard Transformer LM the three runs converge within 7-8 epochs.
Run set
8
Add a comment