eleutherai

GMLP (blue) has slightly more params (200M vs. ~150M) than baseline (pink) and larger batch size - loss is plotted by tokens rather than steps to compensate. GMLP: 9x GMLP blocks + 3x local attention blocks Baseline: 12x dense attention blocks

sdtblck

2021-05-19

4 years ago

Staged Seqlen training 2

pink = baseline blue = 512 90%, 2048 10% green = 128 50%, 2048 50%

sdtblck

2021-05-19

4 years ago

Staged Seq Length Training Test

Shortformer (https://arxiv.org/abs/2012.15832) style. 2 models trained on equal # tokens - one at seqlen = 512 for 90% of training, and seqlen = 2048 for the remaining 10%, and one at seqlen = 2048 for all of training. n.b - for some reason there's a bug in how wandb displays the runs, full screen and then minimize loss by tokens/wallclock time and it should display correctly. Also loss by tokens isn't 100% accurate since it scales by the initial seq length - but final losses should be equal # tokens

sdtblck

2021-05-18

4 years ago

Staged Seq Length Training

Shortformer (https://arxiv.org/abs/2012.15832) style. 2 models trained on equal number of tokens - one at seqlen = 512 for 90% of training, and seqlen = 2048 for the remaining 10%, and one at seqlen = 2048 for the whole training run. (the loss by tokens logging is somewhat inaccurate because it weights everything according to the initial seq length - but final number of tokens should be equal & hence final losses are comparable)

sdtblck

2021-05-18

4 years ago

Adam v. Madgrad

Small (150M) param model

sdtblck

2021-05-17

4 years ago

NeoX 150M model training curves

sdtblck

2021-05-16

4 years ago

bf16 vs fp16

on A100 x4 box

sdtblck

2021-05-14

4 years ago

Project Dashboard

bmk

2021-05-02

4 years ago

Global rotary v. all local rotary

green = global blue = local (sliding window) pink = local (stepped)

sdtblck

2021-05-02

4 years ago

All local v All global (both rotary)

dataset = enron emails

sdtblck

2021-05-01

4 years ago

Sparsity Configs

sdtblck

2021-05-01

4 years ago

Verify Pipe Parallel = 0 and Pipe Parallel >= 1 Models are the same

sdtblck

2021-04-28

4 years ago

Shared panel 21/04/27 23:04:63

shivanshupurohit

2021-04-27

4 years ago

Checkpoint loading parameter / gradient / optimizer state norms

sdtblck

2021-04-24

4 years ago

Checkpoint loading parameter / gradient / optimizer state norms

sdtblck

2021-04-24

4 years ago

Shared panel 21/04/20 20:04:36

shivanshupurohit

2021-04-20

5 years ago

Partial Rotary Tests v2

Results for rotary embeddings applied to only part of q/k. dim per head = 64 Pink - Learned Abs Baseline Brown - Rotary applied to 25% (16/64) Green - Rotary applied to 50% (32/64) Blue - Rotary applied to 100% (64/64) Other Pink - Rotary applied to 25% (16/64) every other layer

sdtblck

2021-04-19

5 years ago

Partial Rotary Tests

sdtblck

2021-04-19

5 years ago

Rope Implementation Comparison

Comparison between Ben's simplified RoPE implementation and the original one

stellaathena

2021-04-18

5 years ago

Rotary Test 3

150M param model on OWT2 with learned embeddings (blue) vs. rotary embeddings (green) vs. rpe (brown) vs. rpe with caching (peach)

sdtblck

2021-04-15

5 years ago

Rotary Test 2

150M param model on OWT2 with learned embeddings (blue) vs. rotary embeddings (green) vs. rpe (brown)

sdtblck

2021-04-15

5 years ago

Rotary Test

150M param model on OWT2 with learned embeddings (blue) vs. rotary embeddings (green)

sdtblck

2021-04-13

5 years ago

Rotary Pos Emb tests 1

sdtblck

2021-04-10

5 years ago

Sparse attn benchmarks

(on A100s)

sdtblck

2021-04-07

5 years ago

Snapshot Apr 4 2021, 10:3pm

stellaathena

2021-04-05

5 years ago

Coreweave 4 GPU vs. Amazon

(model parallel only)

sdtblck

2021-03-31

5 years ago

Amazon v. Coreweave 4GPU v. Coreweave Normal

sdtblck

2021-03-31

5 years ago

AMAZON V. COREWEAVE V. NVIDIA

Comparison of hardware differences on Amazon P4DN instances, vs. an nvidia provided DGX box, vs. Coreweave machines.

sdtblck

2021-03-30

5 years ago

AMAZON V COREWEAVE V NVIDIA

sdtblck

2021-03-30

5 years ago

TopK Attention

Experiments using topk operator within self attention block

sdtblck

2021-03-25

5 years ago

Scaling (across multiple nodes)

General scheme is mp = 2 (across nvlnk bridges), pp=n_gpus/2 (across all other connections), mb_size=16, g.a.s = 64. This may not be the ideal setup but should give us a general idea of the architecture's scalability. As we can see, the nvlink pairs topology appears to scale sub-linearly. It appears from the logs that the bottleneck seems to become the pipeline parallel connections as we scale.

sdtblck

2021-03-22

5 years ago

dp / gas grid search

sdtblck

2021-03-07

5 years ago

Shared panel 21/03/07 22:03:56

shivanshupurohit

2021-03-07

5 years ago

New training script, pp=2, mp=1, regular adam