Vanilla vs simplified (WMT)
Created on May 31|Last edited on October 7
Comment
Section 1
All models do not use bias in attention and have an additional projection before and after embedding to match the model parameter count while keeping the same embedding matrix parameter count.
Groups description:
- Reference - reference transformer implementation from fairseq. Has an additional nonlinearity and a projeciton after self-attention
- only Q - transformer with simple self-attention, only uses a single matrix to compute the output $softmax(x Q x^T) x $
- QV - transformer with simple self-attention with V-matrix softmax(xQxT)xVsoftmax (x Q x^T) x V
- KQV - transformer with vanilla self-attention softmax(xQKTxT)xVsoftmax(x Q K^T x^T) x V, describes the same function space as QV because there is no nonlinearity between Q and K
12
only Q
3
QV
8
KQV
4
Same KQ
1
Speed comparison (Titan X)
We only use the runs that use two Titan X for them to be comparable in the term of speed.
12
0
QV
7
KQV
4
Heavy hparams dependence
Run set 2
3
Add a comment