Skip to main content

Why Nikla's MoE w new tokenizer spikey val?

Created on August 1|Last edited on August 1

Section 1


5k10k15k20k25k30kStep20406080100
5k10k15k20k25k30kStep123
5k10k15k20k25k30kStep46810
5k10k15k20k25k30kStep22.533.54
5k10k15k20k25k30kStep20406080
Run: olmoe-8x1b-newhp-newds-cx5-fine1-newtok
1
Run set 2
1
Run set 3
Run set 4
Name
0 visualized
State
Notes
User
Tags
Created
Runtime
Sweep
activation_checkpointing
algorithms.gradient_clipping.clipping_threshold
algorithms.gradient_clipping.clipping_type
autoresume
auxiliary_loss_multiplier
callbacks.speed_monitor.window_size
canceled_check_interval
compile.backend
compile.fullgraph
composer_commit_hash
composer_version
console_log_interval
data.drop_last
data.generate_attention_mask
data.generate_doc_lengths
data.instance_filter.repetition_max_count
data.instance_filter.repetition_max_period
data.instance_filter.repetition_min_period
data.label_mask_paths
data.memmap_dtype
data.num_workers
data.pad_direction
data.paths
data.persistent_workers
data.pin_memory
data.prefetch_factor
data.timeout
ddp.find_unused_params
ddp.grad_sync_mode
device_eval_batch_size
device_train_batch_size
device_train_grad_accum
device_train_microbatch_size
distributed_strategy
dry_run
enabled_algorithms/GradientClipping
epoch
eval_first
eval_interval
eval_loader.dataset.local
eval_loader.dataset.max_seq_len
eval_loader.dataset.shuffle
eval_loader.dataset.shuffle_seed
0
of 0