How to use wandb for pre-training of LLMs for partners

Created on April 25|Last edited on April 25

Comment

﻿
LLM Pre-training Frameworks:
Megatron-LMShoeybi+ (September 2019) and Narayanan+ (April 2021)
→ Introduction of Tensor Parallelism + Efficiency improvements in Pipeline Parallelism
DeepSpeedZeRO: Memory Optimizations Toward Training Trillion Parameter Models, Rajbhandari+ (October 2019) → Gradual distribution of weights, gradients, and optimizer state.
Megatron-DeepSpeedBy combining ZeRO Optimization, Model Parallel, and Pipeline Parallel, it achieves decentralization that eliminates memory redundancy. This approach was used in the pre-training of BLOOM 175B. (2020/9)
Pathwaysgoogle: Pathways: Asynchronous Distributed Dataflow for ML, Barham+ (2022/03)
NeMo-MegatronMegatron-LM integrated into the cloud-based generative AI framework NeMo.  Optimized for the Hopper generation of GPUs - Transformer Engine (2021/11)
GPT-NeoXGPT-NeoX-20B: An Open-Source Autoregressive Language Model, Black+ (April 2022)
A library that makes pre-training on multiple nodes more accessible, incorporating custom extensions to Megatron-LM and DeepSpeed ⇒ DeeperSpeed.
Supports new technologies such as Rotary Positional Embedding, ALiBi, and Flash Attention.
Supported by Transformers v4.22 released in June 2022.
Many companies within Japan have conducted pre-training using GPT-NeoX.
﻿
Many integration with wandb!GPT neox: https://github.com/EleutherAI/gpt-neox/blob/703d02ff42e5a333cf1708254c37b10d4451bf18/megatron/utils.py#L157﻿
﻿
Megatron-LM:  https://github.com/search?q=repo%3Allm-jp%2FMegatron-LM%20wandb%20&type=code﻿
﻿
 
Nemo
﻿https://wandb.ai/aarora/Nvidia%20NeMO/reports/Train-Optimize-Analyze-Visualize-and-Deploy-Models-for-Automatic-Speech-Recognition-with-NVIDIA-s-NeMo--VmlldzoxNzI0ODEw﻿
﻿
﻿
Value of wandb
Detect "exploding gradients" with real time monitoringUsers can easily detect "exploding gradients" and restart from checkpoints.
﻿
﻿
﻿
﻿https://wandb.ai/llm-jp/megatron-lm-13B-version2?nw=nwuserkeisukekamata﻿
﻿
Visualization of Real time evaluation 
﻿
﻿https://wandb.ai/eleutherai/gpt-thicc/reports/GPT-NeoX-20B-Pretraining--VmlldzoxMTg5MjY3﻿
Log multi-nodes trainings with group runsofficial document: https://docs.wandb.ai/guides/runs/grouping﻿
A example of projects using group
﻿https://wandb.ai/carey/group-demo?nw=nwusercarey﻿
﻿
﻿
...There are cases to track only main node when pretraining
﻿https://wandb.ai/eleutherai/neox/groups/SWZ3VbtNxvzsfqPZQyeiyd_1xllln5d/workspace?nw=nwuserchilli﻿
﻿https://wandb.ai/llm-jp/megatron-lm-13B-version2?nw=nwuserkeisukekamata﻿
﻿
﻿﻿﻿
﻿