HINT experimental report

Created on May 29|Last edited on June 16
Comment
﻿
Joint learning of perception, syntax, semanticsFew-shot learning and generalizationParameter sweeps
﻿
Joint learning of perception, syntax, semantics﻿
symbol35
image35
Name35 visualized
model: GRU
model: GRU
5
model: GRU_attn
model: GRU_attn
5
model: LSTM
model: LSTM
5
model: LSTM_attn
model: LSTM_attn
5
model: TRAN.opennmt
model: TRAN.opennmt
5
model: TRAN.relative
model: TRAN.relative
5
model: TRAN.relative_universal
model: TRAN.relative_universal
5
batch_size
cos_sim_margin
curriculum
dec_layers
dropout
early_stop
emb_dim
enc_layers
epochs
epochs_eval
grad_clip
hid_dim
input
iterations
iterations_eval
layers
lr
lr_scheduler
main_dataset_ratio
max_rel_pos
model
nhead
output_dir
perception_pretrain
pos_emb_type
result_encoding
save_model
seed
train_size
wandb
warmup_steps
n_params
test/result_acc/avg
test/result_acc/digit/0
test/result_acc/digit/1
test/result_acc/digit/2
test/result_acc/digit/3
test/result_acc/digit/4
test/result_acc/digit/5
test/result_acc/digit/6
test/result_acc/digit/7
test/result_acc/digit/8
test/result_acc/digit/9
test/result_acc/eval/I
test/result_acc/eval/LL
test/result_acc/eval/LS
test/result_acc/eval/SL
test/result_acc/eval/SS
test/result_acc/length/1
test/result_acc/length/11
test/result_acc/length/13
128
0.2
no
1
0.1
10
128
3
10
1
5
512
image
100000
1000
1
0.001
constant
0
15
GRU
1
outputs/
sin
decimal
true
2
-
HINT
100
24185549
0.3302
0.99
0.98
1
1
1
1
1
0.92
1
0.97
0.61142
0.11754
0.30784
0.091667
0.53077
0.986
0.4664
0.42846
128
0.2
no
1
0.1
10
128
3
10
1
5
512
image
100000
1000
1
0.001
constant
0
15
GRU_attn
1
outputs/
sin
decimal
true
2
-
HINT
100
26560845
0.35441
0.98
1
1
1
0.99
1
1
0.93
1
0.97
0.65908
0.12304
0.32742
0.093264
0.5793
0.987
0.51275
0.46411
128
0.2
no
1
0.1
10
128
3
10
1
5
512
image
100000
1000
1
0.001
constant
0
15
LSTM
1
outputs/
sin
decimal
true
2
-
HINT
100
28846285
0.46878
0.98
1
1
1
1
1
1
0.92
1
0.97
0.7811
0.18912
0.53706
0.10778
0.73325
0.987
0.58846
0.55289
128
0.2
no
1
0.1
10
128
3
10
1
5
512
image
100000
1000
1
0.001
constant
0
15
LSTM_attn
1
outputs/
sin
decimal
true
2
-
HINT
100
31745869
0.51352
0.98
1
1
1
1
1
1
0.94
1
0.97
0.83902
0.20924
0.61336
0.11264
0.79595
0.989
0.62257
0.58844
128
0.2
no
1
0.1
50
128
6
10
1
5
512
image
100000
1000
1
0.0001
constant
0
15
TRAN.opennmt
8
outputs/
sin
decimal
true
2
-
HINT
100
34486989
0.081785
0.99
1
1
1
1
1
1
0.94
1
0.98
0.20876
0.02906
0.05726
0.014398
0.092725
0.991
0.10101
0.064012
128
0.2
no
1
0.1
50
128
6
10
1
5
512
image
100000
1000
1
0.0001
constant
0
15
TRAN.relative
8
outputs/
sin
decimal
true
2
-
HINT
100
36329165
0.51465
1
1
1
1
1
1
1
0.97
1
0.98
0.8617
0.19282
0.58986
0.10812
0.829
0.995
0.63654
0.60913
128
0.2
no
1
0.1
50
128
6
10
1
5
512
image
100000
1000
1
0.0001
constant
0
15
TRAN.relative_universal
8
outputs/
sin
decimal
true
2
-
HINT
100
19261645
0.53262
1
1
1
1
1
1
1
0.92
1
0.97
0.88319
0.19576
0.62412
0.10928
0.8592
0.989
0.64879
0.61826
1-7
of 7
﻿
For transformer, relative position encoding is much better than absolute position encoding. 
Sharing parameters across layers (universal transformer) can further improve the accuracy.
Transformer models have better accuracy than RNN models.  TRAN.relative_universal (universal transformer with relative positional encoding) has higher accuracy than LSTM_attn on I, SS, and LS, but similar accuracy on SL and LL.
Relative positional encoding is very important for transformer to generalize to longer expressions (LS)
﻿
﻿
Few-shot learning and generalization﻿
﻿
fewshot36
max_op_train_xy22
max_op_train_abcd44
﻿
TRAN.relative_universal has much better performance than LSTM_attn in the few-shot learning experiments and the performance gap mainly comes from the test subsets requiring generalization on syntax and semantics.
﻿
LSTM_attn finetuning use lr=0.001 is best.
﻿
Sweep: jqd53oz9 16
Sweep: jqd53oz9 20
﻿
﻿
Parameter sweeps﻿
For Transformer, hid_dim is the most important parameter than emb_dim and nhead.
﻿
Sweep: g86a824r 10
Sweep: g86a824r 250
﻿
﻿
For Transformer, the number of encoder layers is more important than decoder layers.
﻿
Sweep: xn3zyit4 19
﻿
﻿
Add a comment