Skip to main content
Reports
Created by
Created On
Last edited
0
2022-02-04
0
2021-10-15
0
2021-05-20
0
2021-05-20
GMLP vs. Baseline
GMLP (blue) has slightly more params (200M vs. ~150M) than baseline (pink) and larger batch size - loss is plotted by tokens rather than steps to compensate. GMLP: 9x GMLP blocks + 3x local attention blocks Baseline: 12x dense attention blocks
0
2021-05-19
Staged Seqlen training 2
pink = baseline blue = 512 90%, 2048 10% green = 128 50%, 2048 50%
0
2021-05-19
Staged Seq Length Training Test
Shortformer (https://arxiv.org/abs/2012.15832) style. 2 models trained on equal # tokens - one at seqlen = 512 for 90% of training, and seqlen = 2048 for the remaining 10%, and one at seqlen = 2048 for all of training. n.b - for some reason there's a bug in how wandb displays the runs, full screen and then minimize loss by tokens/wallclock time and it should display correctly. Also loss by tokens isn't 100% accurate since it scales by the initial seq length - but final losses should be equal # tokens
0
2021-05-18
Staged Seq Length Training
Shortformer (https://arxiv.org/abs/2012.15832) style. 2 models trained on equal number of tokens - one at seqlen = 512 for 90% of training, and seqlen = 2048 for the remaining 10%, and one at seqlen = 2048 for the whole training run. (the loss by tokens logging is somewhat inaccurate because it weights everything according to the initial seq length - but final number of tokens should be equal & hence final losses are comparable)
0
2021-05-18
Adam v. Madgrad
Small (150M) param model
0
2021-05-17
0
2021-05-16
bf16 vs fp16
on A100 x4 box
0
2021-05-14
0
2021-05-02
Global rotary v. all local rotary
green = global blue = local (sliding window) pink = local (stepped)
0
2021-05-02
0
2021-05-01
0
2021-05-01
0
2021-04-27
0
2021-04-20
Partial Rotary Tests v2
Results for rotary embeddings applied to only part of q/k. dim per head = 64 Pink - Learned Abs Baseline Brown - Rotary applied to 25% (16/64) Green - Rotary applied to 50% (32/64) Blue - Rotary applied to 100% (64/64) Other Pink - Rotary applied to 25% (16/64) every other layer
0
2021-04-19
Partial Rotary Tests
Results for rotary embeddings applied to only part of q/k. dim per head = 64 Pink - Learned Abs Baseline Brown - Rotary applied to 25% (16/64) Green - Rotary applied to 50% (32/64) Blue - Rotary applied to 100% (64/64)
0
2021-04-19
Rope Implementation Comparison
Comparison between Ben's simplified RoPE implementation and the original one
0
2021-04-18
Rotary Test 3
150M param model on OWT2 with learned embeddings (blue) vs. rotary embeddings (green) vs. rpe (brown) vs. rpe with caching (peach)
0
2021-04-15
Rotary Test 2
150M param model on OWT2 with learned embeddings (blue) vs. rotary embeddings (green) vs. rpe (brown)
0
2021-04-15
Rotary Test
150M param model on OWT2 with learned embeddings (blue) vs. rotary embeddings (green)
0
2021-04-13
0
2021-04-10
0
2021-04-07
0
2021-04-05
Coreweave 4 GPU vs. Amazon
(model parallel only)
1
2021-03-31
0
2021-03-31
AMAZON V. COREWEAVE V. NVIDIA
Comparison of hardware differences on Amazon P4DN instances, vs. an nvidia provided DGX box, vs. Coreweave machines.
0
2021-03-30
0
2021-03-30
TopK Attention
Experiments using topk operator within self attention block
0
2021-03-25
Scaling (across multiple nodes)
General scheme is mp = 2 (across nvlnk bridges), pp=n_gpus/2 (across all other connections), mb_size=16, g.a.s = 64. This may not be the ideal setup but should give us a general idea of the architecture's scalability. As we can see, the nvlink pairs topology appears to scale sub-linearly. It appears from the logs that the bottleneck seems to become the pipeline parallel connections as we scale.
0
2021-03-22
0
2021-03-07
0
2021-03-07
New training script, pp=2, mp=1, regular adam
16 GPUs (wandb says 8 but rank goes from 0 to 15), nhidden=1024, num_layers=24, sparsity on. Unusually low loss.
0
2021-02-27
Snapshot Feb 24 2021, 2:18pm
GPT2_XL_pipe, regular adam. mp=2, pp=2, size reduced to match 1-bit adam model.
0
2021-02-24
Snapshot Feb 24 2021, 2:9am
GPT2-XL, regular adam, mp=2, pp=2, ZeRO-1
0
2021-02-23
Snapshot Feb 23 2021, 8:50pm
One-bit adam, no changes from the main branch
0
2021-02-23
Snapshot Feb 23 2021, 2:21am
3D parallelism with no ZeRO, ran for 500 steps.
0
2021-02-22
0
2021-02-22