Skip to main content

How to use wandb for pre-training of LLMs for partners

Created on April 25|Last edited on April 25

LLM Pre-training Frameworks:

Megatron-LM

Shoeybi+ (September 2019) and Narayanan+ (April 2021)
→ Introduction of Tensor Parallelism + Efficiency improvements in Pipeline Parallelism

DeepSpeed

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, Rajbhandari+ (October 2019) → Gradual distribution of weights, gradients, and optimizer state.

Megatron-DeepSpeed

By combining ZeRO Optimization, Model Parallel, and Pipeline Parallel, it achieves decentralization that eliminates memory redundancy. This approach was used in the pre-training of BLOOM 175B. (2020/9)

Pathways

google: Pathways: Asynchronous Distributed Dataflow for ML, Barham+ (2022/03)

NeMo-Megatron

Megatron-LM integrated into the cloud-based generative AI framework NeMo. Optimized for the Hopper generation of GPUs - Transformer Engine (2021/11)

GPT-NeoX

GPT-NeoX-20B: An Open-Source Autoregressive Language Model, Black+ (April 2022)
  • A library that makes pre-training on multiple nodes more accessible, incorporating custom extensions to Megatron-LM and DeepSpeed ⇒ DeeperSpeed.
  • Supports new technologies such as Rotary Positional Embedding, ALiBi, and Flash Attention.
  • Supported by Transformers v4.22 released in June 2022.
  • Many companies within Japan have conducted pre-training using GPT-NeoX.


Many integration with wandb!



Nemo



Value of wandb

Detect "exploding gradients" with real time monitoring

Users can easily detect "exploding gradients" and restart from checkpoints.





Visualization of Real time evaluation



Log multi-nodes trainings with group runs

A example of projects using group


...There are cases to track only main node when pretraining