Mini Diffuser : Main Results

Created on April 2|Last edited on May 9
Comment
﻿
﻿
Training Cost for a RLbench-18 multi-task policyHow much can Layer-2 mini-batch accelerate trainingHow does other design choice contribute to the performance1. 3D Relative ROPE2. PTv3 Backbone3. Conv Kernel as Extra Query
﻿
Training Cost for a RLbench-18 multi-task policy﻿
Training Time (Hours)
Training Time (Hours)
group: a100 18taskgroup: 4090-18task01020
total/pos_acc_0.01
total/pos_acc_0.01
50100150200250Step00.20.40.60.8
group: a100 18task
group: 4090-18task
Run set7
﻿
How much can Layer-2 mini-batch accelerate trainingdoubling the Layer-2 mini-batch size by several times can significantly accelerate training, with nearly no extra time cost and a little more memory cost.
﻿
Run set18
﻿
How does other design choice contribute to the performance
1. 3D Relative ROPEadding 3d ROPE in cross attention part helps the model learns faster, especially with larger L2-batches. But when disable L2-batches, and gives long enough time for training, the final performance is the same, bacause tokens in the PTv3 are already spacial aware with their fractal ording like 'Z' 'inverse-Z'...
﻿
Run set8
﻿
﻿
﻿
Run set2
﻿
2. PTv3 Backbone	At preliminary stage, we tried using exact the same structure as 3d-diffuser-actor (vanilla transformer), and acheive approximately the same performance, but with ~25% more time cost and ~50% more memory cost.
﻿
Run set6
﻿
3. Conv Kernel as Extra QueryLocal kernel Query mainly helps with training stablity in the early stages. It slightly helps prediction accuracy. (without it, the grad in the early stage can reach to a huge value, and may cause NaN)
We later found that with carefully tuned learning rate schedule and grad clip. the LocalConv trick might be deprecate. because all the "neighborbood" information provide by it can be eventually learned by attention layers. 
We still keep the option 'On' as our offical implementation.
﻿
Run set4
﻿
﻿
﻿
Add a comment