Skip to main content

qLoRA memory consumption

Comparison of Mistral 7B on axolotl memory consumption for different configs
Created on January 7|Last edited on January 7

Using Mistral 7B on Alpaca sample

We use the same effective batch size in all experiments:
Effective_bs=micro_bs  ×  grad_accumulation_steps=16\displaystyle{\text{Effective\_bs} = \text{micro\_bs} \;\times \; \text{grad\_accumulation\_steps}=16}

  • If micro_bs=1 then grad_accum=16,
  • If micro_bs=4 then grad_accum=4

This ensures the same number of steps and updates.




meta
19m 31s
19m 26s
13m 33s
11m 40s
15m 11s
config
axolotl_config
16
16
16
16
4
true
true
false
false
true
["lm_head"]
-
-
-
["lm_head"]
true
true
true
-
true
-
-
-
["q_proj","v_proj"]
-
1
1
1
1
4
gradient_checkpointing_kwargs
false
false
-
-
false
16
16
16
16
4
16
16
16
16
4
true
true
false
false
true
./qlora-out/runs/Jan07_12-32-12_1996c6d5ac6c
./qlora-out/runs/Jan07_12-09-04_1996c6d5ac6c
./qlora-out/runs/Jan07_11-05-34_1996c6d5ac6c
./qlora-out/runs/Jan05_17-13-12_64dcd0883459
./qlora-out/runs/Jan05_15-43-47_64dcd0883459
1
1
1
1
4
1
1
1
1
4
0.85
0.86
0.85
0.85
0.97
1
1
1
1
4
summary
_wandb
train
bs=1, gc=true, target=linear+headbs=1, gc=true, target=linearbs=1, gc=false, target=linearbs=1, gc=false, target=q, vbs=4, gc=true, target=linear0.00.10.20.30.40.50.60.70.80.91.01.1
Run set
5