Skip to main content

Mini Diffuser : Main Results

Created on April 2|Last edited on May 9



Training Cost for a RLbench-18 multi-task policy


group: a100 18taskgroup: 4090-18task01020
50100150200250Step00.20.40.60.8
group: a100 18task
group: 4090-18task
Run set
7


How much can Layer-2 mini-batch accelerate training

doubling the Layer-2 mini-batch size by several times can significantly accelerate training, with nearly no extra time cost and a little more memory cost.

Run set
18


How does other design choice contribute to the performance

1. 3D Relative ROPE

adding 3d ROPE in cross attention part helps the model learns faster, especially with larger L2-batches. But when disable L2-batches, and gives long enough time for training, the final performance is the same, bacause tokens in the PTv3 are already spacial aware with their fractal ording like 'Z' 'inverse-Z'...

Run set
8



Run set
2


2. PTv3 Backbone

At preliminary stage, we tried using exact the same structure as 3d-diffuser-actor (vanilla transformer), and acheive approximately the same performance, but with ~25% more time cost and ~50% more memory cost.

Run set
6


3. Conv Kernel as Extra Query

Local kernel Query mainly helps with training stablity in the early stages. It slightly helps prediction accuracy. (without it, the grad in the early stage can reach to a huge value, and may cause NaN)
We later found that with carefully tuned learning rate schedule and grad clip. the LocalConv trick might be deprecate. because all the "neighborbood" information provide by it can be eventually learned by attention layers.
We still keep the option 'On' as our offical implementation.

Run set
4