How to use wandb for pre-training of LLMs for partners
Created on April 25|Last edited on April 25
Comment
LLM Pre-training Frameworks:
Megatron-LM
Shoeybi+ (September 2019) and Narayanan+ (April 2021)
→ Introduction of Tensor Parallelism + Efficiency improvements in Pipeline Parallelism
DeepSpeed
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, Rajbhandari+ (October 2019) → Gradual distribution of weights, gradients, and optimizer state.
Megatron-DeepSpeed
By combining ZeRO Optimization, Model Parallel, and Pipeline Parallel, it achieves decentralization that eliminates memory redundancy. This approach was used in the pre-training of BLOOM 175B. (2020/9)
Pathways
google: Pathways: Asynchronous Distributed Dataflow for ML, Barham+ (2022/03)
NeMo-Megatron
Megatron-LM integrated into the cloud-based generative AI framework NeMo. Optimized for the Hopper generation of GPUs - Transformer Engine (2021/11)
GPT-NeoX
GPT-NeoX-20B: An Open-Source Autoregressive Language Model, Black+ (April 2022)
- A library that makes pre-training on multiple nodes more accessible, incorporating custom extensions to Megatron-LM and DeepSpeed ⇒ DeeperSpeed.
- Supports new technologies such as Rotary Positional Embedding, ALiBi, and Flash Attention.
- Supported by Transformers v4.22 released in June 2022.
- Many companies within Japan have conducted pre-training using GPT-NeoX.
Many integration with wandb!


Nemo

Value of wandb
Detect "exploding gradients" with real time monitoring
Users can easily detect "exploding gradients" and restart from checkpoints.


Visualization of Real time evaluation

Log multi-nodes trainings with group runs
A example of projects using group

...There are cases to track only main node when pretraining
Add a comment