Skip to main content

HINT experimental report

Created on May 29|Last edited on June 16


Joint learning of perception, syntax, semantics


symbol
35
image
35
Name
35 visualized
5
5
5
5
5
5
5
batch_size
cos_sim_margin
curriculum
dec_layers
dropout
early_stop
emb_dim
enc_layers
epochs
epochs_eval
grad_clip
hid_dim
input
iterations
iterations_eval
layers
lr
lr_scheduler
main_dataset_ratio
max_rel_pos
model
nhead
output_dir
perception_pretrain
pos_emb_type
result_encoding
save_model
seed
train_size
wandb
warmup_steps
n_params
test/result_acc/avg
test/result_acc/digit/0
test/result_acc/digit/1
test/result_acc/digit/2
test/result_acc/digit/3
test/result_acc/digit/4
test/result_acc/digit/5
test/result_acc/digit/6
test/result_acc/digit/7
test/result_acc/digit/8
test/result_acc/digit/9
test/result_acc/eval/I
test/result_acc/eval/LL
test/result_acc/eval/LS
test/result_acc/eval/SL
test/result_acc/eval/SS
test/result_acc/length/1
test/result_acc/length/11
test/result_acc/length/13
128
0.2
no
1
0.1
10
128
3
10
1
5
512
image
100000
1000
1
0.001
constant
0
15
GRU
1
outputs/
sin
decimal
true
2
-
HINT
100
24185549
0.3302
0.99
0.98
1
1
1
1
1
0.92
1
0.97
0.61142
0.11754
0.30784
0.091667
0.53077
0.986
0.4664
0.42846
128
0.2
no
1
0.1
10
128
3
10
1
5
512
image
100000
1000
1
0.001
constant
0
15
GRU_attn
1
outputs/
sin
decimal
true
2
-
HINT
100
26560845
0.35441
0.98
1
1
1
0.99
1
1
0.93
1
0.97
0.65908
0.12304
0.32742
0.093264
0.5793
0.987
0.51275
0.46411
128
0.2
no
1
0.1
10
128
3
10
1
5
512
image
100000
1000
1
0.001
constant
0
15
LSTM
1
outputs/
sin
decimal
true
2
-
HINT
100
28846285
0.46878
0.98
1
1
1
1
1
1
0.92
1
0.97
0.7811
0.18912
0.53706
0.10778
0.73325
0.987
0.58846
0.55289
128
0.2
no
1
0.1
10
128
3
10
1
5
512
image
100000
1000
1
0.001
constant
0
15
LSTM_attn
1
outputs/
sin
decimal
true
2
-
HINT
100
31745869
0.51352
0.98
1
1
1
1
1
1
0.94
1
0.97
0.83902
0.20924
0.61336
0.11264
0.79595
0.989
0.62257
0.58844
128
0.2
no
1
0.1
50
128
6
10
1
5
512
image
100000
1000
1
0.0001
constant
0
15
TRAN.opennmt
8
outputs/
sin
decimal
true
2
-
HINT
100
34486989
0.081785
0.99
1
1
1
1
1
1
0.94
1
0.98
0.20876
0.02906
0.05726
0.014398
0.092725
0.991
0.10101
0.064012
128
0.2
no
1
0.1
50
128
6
10
1
5
512
image
100000
1000
1
0.0001
constant
0
15
TRAN.relative
8
outputs/
sin
decimal
true
2
-
HINT
100
36329165
0.51465
1
1
1
1
1
1
1
0.97
1
0.98
0.8617
0.19282
0.58986
0.10812
0.829
0.995
0.63654
0.60913
128
0.2
no
1
0.1
50
128
6
10
1
5
512
image
100000
1000
1
0.0001
constant
0
15
TRAN.relative_universal
8
outputs/
sin
decimal
true
2
-
HINT
100
19261645
0.53262
1
1
1
1
1
1
1
0.92
1
0.97
0.88319
0.19576
0.62412
0.10928
0.8592
0.989
0.64879
0.61826
1-7
of 7

For transformer, relative position encoding is much better than absolute position encoding.
Sharing parameters across layers (universal transformer) can further improve the accuracy.
Transformer models have better accuracy than RNN models. TRAN.relative_universal (universal transformer with relative positional encoding) has higher accuracy than LSTM_attn on I, SS, and LS, but similar accuracy on SL and LL.
Relative positional encoding is very important for transformer to generalize to longer expressions (LS)



Few-shot learning and generalization



fewshot
36
max_op_train_xy
22
max_op_train_abcd
44

TRAN.relative_universal has much better performance than LSTM_attn in the few-shot learning experiments and the performance gap mainly comes from the test subsets requiring generalization on syntax and semantics.

LSTM_attn finetuning use lr=0.001 is best.

Sweep: jqd53oz9 1
6
Sweep: jqd53oz9 2
0



Parameter sweeps


For Transformer, hid_dim is the most important parameter than emb_dim and nhead.

Sweep: g86a824r 1
0
Sweep: g86a824r 2
50


For Transformer, the number of encoder layers is more important than decoder layers.

Sweep: xn3zyit4 1
9