apiche

Apiche's group workspace

Group: test_cumulative_time

1-3

of 3

Tags

Notes

Author

apiche

State

Crashed

Start time

July 29th, 2025 11:25:26 PM

Runtime

6d 16h 1m 47s

Tracked hours

Run path

apiche/pipeline-rl/test_cumulative_time_finetune

Linux-5.15.0-1067-nvidia-x86_64-with-glibc2.39

Python version

CPython 3.11.11

Git repository

git clone https://github.com/ServiceNow/PipelineRL-SWE

Git state

git checkout -b "test_cumulative_time/finetune" 6ff6542654d2417495552005f0b5d26a23675792

Command

pipelinerl/entrypoints/run_finetune.py --config-dir results/test_cumulative_time/conf --config-name exp_config output_dir=results/test_cumulative_time hydra.run.dir=results/test_cumulative_time/finetune +me.weight_update_group_init_method=tcp://localhost:9000 +me.weight_update_group_world_size=3 +me.llm_urls=http://localhost:8080+http://localhost:8081

System Hardware

CPU count	112
Logical CPU count	224
GPU count	4
GPU type	NVIDIA H100 80GB HBM3

W&B CLI Version

0.19.11

Group

test_cumulative_time

Config parameters are your model's inputs. Learn more

▶
Config parameters:{} 199 keys
- _cpu:
  "False"
- _mixed_precision:
  "no"
- accelerate_config:
  null
- actor.discount_factor:
  1
- actor.llm_max_rollouts:
  64
- actor.log_each_n_secs:
  10
- actor.problem_queue_size:
  64
- actor.result_queue_size:
  64
- actor.rollout_policy:
  "pipelinerl.swe.rollouts.generate_unified_swe_rollout"
- actor.rollout_workers:
  1
- actor.shared_memory_entry_size:
  50,000,000
- actor.system_prompt:
  "Please reason step by step, and put your final answer within \boxed{}."
- actor.task_template:
  "{task}"
- actor.throughput_window_size:
  50
- agent.max_prompt_length:
  15,000
- attempts:
  1
- backend:
  "nccl"
- dataset_loader:
  "pipelinerl.swe.load_datasets.load_local_swe_dataset"
- dataset_loader_params.dataset_path:
  "/mnt/llmd/data/swegym/ds"
- dataset_loader_params.test_dataset_path:
  "/mnt/llmd/data/swebench_lite/ds"
- debug:
  "False"
- debug.mode:
  ""
- debug.place_inference_workers:
  true
- debug.streams_from:
  null
- debug.use_existing_llms:
  false
- deepspeed_config:
  "deepspeed_stage3_bf16"
- deepspeed_plugins:
  "DeepSpeedPlugin(hf_ds_config=<accelerate.utils.deepspeed.HfDeepSpeedConfig object at 0x7ffc76f6f090>, gradient_accumulation_steps=1, gradient_clipping='auto', zero_stage=3, is_train_batch_min=True, offload_optimizer_device='none', offload_param_device='none', offload_optimizer_nvme_path='none', offload_param_nvme_path='none', zero3_init_flag=True, zero3_save_16bit_model=True, transformer_moe_cls_names=None, enable_msamp=False, msamp_opt_level='O1')"
- device:
  "cuda:0"
- distributed_type:
  "DistributedType.DEEPSPEED"
- dynamo_plugin:
  "TorchDynamoPlugin(backend=<DynamoBackend.NO: 'NO'>, mode='default', fullgraph=False, dynamic=False, options=None, disable=False)"
- environment:
  null
- environment._target_:
  "pipelinerl.domains.math.MathEnvironment"
- eval_every_n_versions:
  1,000
- finetune.also_save_steps:[] 0 items
- finetune.attempts:
  1
- finetune.attn_implementation:
  "flash_attention_2"
- finetune.auto_device_map:
  false
- finetune.config_name:
  "Qwen/Qwen2.5-1.5B-Instruct"
- finetune.cuda_empty_cache:
  true
- finetune.data:
  null
- finetune.eval_callback._target_:
  "tapeagents.finetune.eval.dummy_eval_callback"
- finetune.eval_callback.config_name:
  ""
- finetune.force_restart:
  true
- finetune.gradient_accumulation_passes:
  512
- finetune.gradient_checkpointing:
  true
- finetune.gradient_clipping_threshold:
  0.3
- world.environment_start_port:
  7,777
- world.finetune_fraction:
  4
- world.preprocessor_fraction:
  0
- world.replicas:
  1

Summary metrics are your model's outputs. Learn more

No summary metrics saved for this run.

Check the summary metrics documentation for more information.