TPU Sebulba 200k FPS experiment
Created on March 26|Last edited on April 2
Comment
TPUv4. Close enough 48928 * 4 = 195712 FPS. While we can reach this level of FPS, sample efficiency could suffer significantly.

TPU_CHIPS_PER_PROCESS_BOUNDS=1,2,1 \TPU_PROCESS_BOUNDS=2,1,1 \TPU_PROCESS_ADDRESSES=localhost:8479,localhost:8480 \TPU_VISIBLE_DEVICES=0,1 \TPU_PROCESS_PORT=8479 \SLURM_JOB_ID=26017 SLURM_STEP_NODELIST=localhost SLURM_NTASKS=2 SLURM_PROCID=0 SLURM_LOCALID=0 SLURM_STEP_NUM_NODES=2 \CLOUD_TPU_TASK_ID=0 python cleanba/cleanba_impala_envpool_machado_atari_wrapper_threads.py --distributed --learner-device-ids 1 --num-steps 60 --local-num-envs 96 --num-actor-threads 3 --track --seed 1TPU_CHIPS_PER_PROCESS_BOUNDS=1,2,1 \TPU_PROCESS_BOUNDS=2,1,1 \TPU_PROCESS_ADDRESSES=localhost:8479,localhost:8480 \TPU_VISIBLE_DEVICES=2,3 \TPU_PROCESS_PORT=8480 \SLURM_JOB_ID=26017 SLURM_STEP_NODELIST=localhost SLURM_NTASKS=2 SLURM_PROCID=1 SLURM_LOCALID=0 SLURM_STEP_NUM_NODES=2 \CLOUD_TPU_TASK_ID=1 python cleanba/cleanba_impala_envpool_machado_atari_wrapper_threads.py --distributed --learner-device-ids 1 --num-steps 60 --local-num-envs 96 --num-actor-threads 3 --seed 1
The following label a0_l1_d2_t3_n96 means actor using GPU 0, learner using GPU 1, distributed 2 times, using 3 actor threads, each thread having 96 environments. In this case, there are 96*2*3 = 576 environments
In PPO, n192_b96 means num_envs=192, with EnvPool's batch_size=96.
4 TPUv4 chip or 8 core (a0_l1_d2_t3_n96, IMPALA)
1
8 A100 (a0_l1+2+3_d2_t3_n96, IMPALA)
1
impala baseline (a0_l1_d4_t1_n30)
1
8 A100 (a0_l1+2+3_d2_t1_n192_b96, PPO)
1
8 A100 (tuned, but w/ less SPS) has less SPS but is well-tuned...
Stability.ai's HPC:

Accelerators are not well utilized, despite the high SPS.


PPO:
CUDA_VISIBLE_DEVICES="0,1,2,3" \SLURM_JOB_ID=26017 \SLURM_STEP_NODELIST=localhost \SLURM_NTASKS=2 \SLURM_PROCID=0 \SLURM_LOCALID=0 \SLURM_STEP_NUM_NODES=2 python cleanba/cleanba_ppo_envpool_impala_atari_wrapper.py --distributed --local-num-envs 192 --async-batch-size 96 --num-steps 64 --learner-device-ids 1 2 3 --num-minibatches 1 --update-epochs 1 --trackCUDA_VISIBLE_DEVICES="4,5,6,7" \SLURM_JOB_ID=26017 \SLURM_STEP_NODELIST=localhost \SLURM_NTASKS=2 \SLURM_PROCID=1 \SLURM_LOCALID=0 \SLURM_STEP_NUM_NODES=2 python cleanba/cleanba_ppo_envpool_impala_atari_wrapper.py --distributed --local-num-envs 192 --async-batch-size 96 --num-steps 64 --learner-device-ids 1 2 3 --num-minibatches 1 --update-epochs 1 --track

Add a comment