Vanilla vs simplified (WMT)

Created on May 31|Last edited on October 7
Comment
﻿
Section 1All models do not use bias in attention and have an additional projection before and after embedding to match the model parameter count while keeping the same embedding matrix parameter count.
Groups description: 
Reference - reference transformer implementation from fairseq. Has an additional nonlinearity and a projeciton after self-attention
only Q - transformer with simple self-attention, only uses a single matrix to compute the output $softmax(x Q x^T) x $
QV - transformer with simple self-attention with V-matrix softmax(xQxT)xVsoftmax (x Q x^T) x Vsoftmax(xQxT)xV
KQV - transformer with vanilla self-attention softmax(xQKTxT)xVsoftmax(x Q K^T x^T) x Vsoftmax(xQKTxT)xV, describes the same function space as QV because there is no nonlinearity between Q and K
﻿
﻿
﻿
valid/bleu
valid/bleu
50k100k150k200k250kStep2022242628
: -   Same KQ
: -   only Q
: -   KQV
: -   QV
 
Reference12
only Q3
QV8
KQV4
Same KQ1
﻿
Speed comparison (Titan X)We only use the runs that use two Titan X for them to be comparable in the term of speed.
﻿
﻿
﻿
 
Reference12
 
only Q0
QV7
KQV4
﻿
Heavy hparams dependence
﻿
﻿
﻿
Run set 23
﻿
﻿
Add a comment