Skip to main content

CPU/GPU and reproducibility

Created on November 21|Last edited on November 27
I triggered six jobs with exactly the same mouse data and the random seed for the Trainer, but with different resources (GPU, CPU, and RAM).
(-m data.subject_ids=[779531] model.penalties.beta=0.01 model.training.lr=0.005 model.training.n_warmup_steps=1000 model.training.n_steps=3000)


102030405060Time (minutes)01234
102030405060Time (minutes)020406080100
01k2k3kStep0.20.30.40.50.60.70.80.912345
102030405060Time (minutes)0.20.30.40.50.60.70.80.912345
Run set
6


Three observations:
  1. GPU is slower than CPU (left panel), yet significantly more expensive ($ in the legend = estimated cost of pipeline run in CO). On the right panel, GPU was really working hard, while CPU utilization was very low (<5%). Why?
  2. Fixed random seed (as confirmed by the same splitted keys) does not yield the same trajectory (middle panel). Two of the runs even did not converge.
  3. The "1gpu1cpu16G" run had a fatal error (NaN in parms) during warmup, which also happened occationally before.
2025-11-20 18:14:14
Step 271 of 1000. Training Loss: 1.50e-01. Validation Loss: 1.94e-01
2025-11-20 18:14:22
Traceback (most recent call last):
2025-11-20 18:14:22
File "/tmp/nxf.mgkksNKEiA/capsule/code/run_capsule.py", line 57, in <module>
2025-11-20 18:14:22
main()
2025-11-20 18:14:22
File "/tmp/nxf.mgkksNKEiA/capsule/code/run_capsule.py", line 49, in main
2025-11-20 18:14:22
output = model_trainer.fit(dataset_bundle, loggers=loggers)
2025-11-20 18:14:22
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-11-20 18:14:22
File "/tmp/nxf.mgkksNKEiA/capsule/code/model_trainers/disrnn_trainer.py", line 139, in fit
2025-11-20 18:14:22
params, warmup_opt_state, warmup_losses = rnn_utils.train_network(
2025-11-20 18:14:22
^^^^^^^^^^^^^^^^^^^^^^^^
2025-11-20 18:14:23
File "/opt/conda/lib/python3.12/site-packages/disentangled_rnns/library/rnn_utils.py", line 672, in train_network
2025-11-20 18:14:23
raise ValueError('NaN in params')
2025-11-20 18:14:23
ValueError: NaN in params
💡

11/26/2025 update

Reproducibility

After fixing the numpy random seed (GitHub issue), runs are now reproducible. Same data and model, different cpu numbers. Seems like 2cpu is the sweet spot? (Note: Time includes data loading time)

Run set
3


Run time vs Model size

Tried different network architectures: latent_size = 4 or 5; update_net_unit_per_layer and update_net_n_layers = (16, 8) or (4, 1).
  • Impact on training speed: update_unit_per_layer > update_n_layers >> latent_size
  • When using latent_size = 4, update_unit_per_layer = 4, update_n_layers = 1 ("4_4_1_4_1"), the running time is around 10 mins, comparable to Lukasz's tests (with GPU).
  • The likelihood, the bottlenecks, and the rules found by these models are similar (at least with beta = 0.01 and lr = 0.005)

Run set
4


The NaN-crash issue

  • It is interesting that the loss trajectory is reproducible, but the NaN-crash is not. (run 1 crahsed much earlier than run 2)
  • Clipping max_grad_norm=1 indeed solves the issue (run 3)! (this run is done in a capsule; using the same nax_grad_norm=1 in the CO pipeline did not prevent NaN-crash... see below)

Run set
3

  • max_grad_norm is effective in avoiding runaway, but the effects depend on whether the run is executed in CO capsule or CO pipeline (still very confusing to me! don't want to dig deeper at this point.)

Run set
6