Mini Diffuser : Main Results
Created on April 2|Last edited on May 9
Comment

Training Cost for a RLbench-18 multi-task policyHow much can Layer-2 mini-batch accelerate trainingHow does other design choice contribute to the performance1. 3D Relative ROPE2. PTv3 Backbone3. Conv Kernel as Extra Query
Training Cost for a RLbench-18 multi-task policy
Run set
7
How much can Layer-2 mini-batch accelerate training
doubling the Layer-2 mini-batch size by several times can significantly accelerate training, with nearly no extra time cost and a little more memory cost.
Run set
18
How does other design choice contribute to the performance
1. 3D Relative ROPE
adding 3d ROPE in cross attention part helps the model learns faster, especially with larger L2-batches. But when disable L2-batches, and gives long enough time for training, the final performance is the same, bacause tokens in the PTv3 are already spacial aware with their fractal ording like 'Z' 'inverse-Z'...
Run set
8
Run set
2
2. PTv3 Backbone
At preliminary stage, we tried using exact the same structure as 3d-diffuser-actor (vanilla transformer), and acheive approximately the same performance, but with ~25% more time cost and ~50% more memory cost.
Run set
6
3. Conv Kernel as Extra Query
Local kernel Query mainly helps with training stablity in the early stages. It slightly helps prediction accuracy. (without it, the grad in the early stage can reach to a huge value, and may cause NaN)
We later found that with carefully tuned learning rate schedule and grad clip. the LocalConv trick might be deprecate. because all the "neighborbood" information provide by it can be eventually learned by attention layers.
We still keep the option 'On' as our offical implementation.
Run set
4
Add a comment