620 Int8 Training

Created on February 23|Last edited on February 28
Comment
Integrate int8 training feature from AQT﻿﻿ library into levanter/haliax.
v5e-256 (eu-west4)
1.4BTraining loss of int8 matches the loss of baseline (32/16 mixed precision training). <1% difference.
Naive default int8 config (magenta) performs poorly (40% MFU)
Maxtext's default int8 config (green) outperforms the baseline (60% vs 57%)
﻿
train/loss
train/loss
Select runs that logged train/loss 
to visualize data in this line chart.
throughput/tokens_per_second
throughput/tokens_per_second
Select runs that logged throughput/tokens_per_second 
to visualize data in this line chart.
throughput/mfu
throughput/mfu
Select runs that logged throughput/mfu 
to visualize data in this line chart.
﻿
8BTraining loss of int8 matches the loss of baseline. <1% difference.
Int8 (magenta) significantly outperforms baseline (72% vs 61% MFU)
Throughput gets a 17.4% bump!
﻿
﻿
﻿
Multislice (2x v5e-256)As expected
14.7% increase in throughput
﻿
﻿
v4-256 (us-central2)doesn't work
﻿
﻿
﻿
Add a comment