Mixtral Experiments Notes
Searching for a strong Mixtral fine-tuning recipe
Created on December 30|Last edited on January 18
Comment
Training a Large Language Model (LLM) on your data is complex. It requires curating a high-quality dataset, choosing a good pre-trained model capable of solving your task, provisioning the necessary computing to run the training, and efficiently running the code to fine-tune the model. The data and the model are very specific to your needs, but we can help you provision the compute thanks to our partner CoreWeave and ensure that your training is as smooth as possible by leveraging W&B Launch.
Fine-Tuning Mixtral using Launch on CoreWeave
To test our brand new integration, we decided to grab a pod of the fastest GPUs on the planet Nvidia H100 with 80GB of VRAM, and fine-tune the best open-source model out there, the newly released Mistral.ai Mixture of Experts a.k.a Mixtral 8x7B.
To do so, we created a fine-tuning job by leveraging the Axolotl library; as this library already has W&B integration, we were finetuning a model quickly!
See results charts below
Experiments
We will perform instruction tuning using qLoRA on the newly released Mixtral 8x7B model. The code used for these experiments can be found here.
The main parameters of the training that are shared accros experiments are the dataset and the base model:
# from the axolotl configurationbase_model: mistralai/Mixtral-8x7B-v0.1model_type: AutoModelForCausalLMtokenizer_type: LlamaTokenizerload_in_4bit: truedatasets:- path: Open-Orca/SlimOrcatype: sharegptconversation: chatml
Dataset: We are using SlimOrca. This dataset has around 500k rows of GPT4 generated completitions in the chatML format.
1. Baseline
2. Baseline + target layers
- 2 epoch
- lora_target_modules:
- q_proj- k_proj- v_proj- o_proj- w1- w2- w3
3. lora_taraget_linear == false
- lora_target_linear == false
- lora_target_modules:
- q_proj- k_proj- v_proj- o_proj- w1- w2- w3
4. Baseline + higher learning rate 1e-3
5. Baseline 2 epochs + lr 1e-3 + lora_r = 16 + lora_a = 32 + lora_dropout = 0.05
- lora_r = 16 (was 64)
- lora_a = 32 (unchanged)
- lora_dropout = 0.05 (from 0.1)
- Based on NeurIPS Efficiency Challenge configs (see Resources section below)
- 25 warmup steps
- Run: 8zqoda2b
6. Baseline 2 epochs + lr 1e-3 + 150 warmup steps
- warmup steps increased from 25 -> 150
- Run: l1k9f0mh
7. Exp 5 + lower lr (2e-4)
- Run: utkl02yv
8. Dolphin-2.7-Mixtral-8x7b config
- Run: yvuz36xp
- lora_r = 32
- lora_a = 16
- lora_dropout = 0.05
- warmup steps = 10
- weight_decay = 0.05 (previous was 0.1)
9. Exp 5 + paged_lion_8bit optimizer + bs == 2
- Run: k9pk3y3m, 64htlbhx, 8vzjmg4e
"optimizer": "paged_lion_8bit","micro_batch_size": 2
- Higher BS might OOM, lets see if lion can free up enough memory -- it did OOM and also loss when off the chart, lr 1e-3 probably way too high, restarted with bs == 1
10. Exp 5 + qkvo layers
- Run: ts0b0bq3
- Changes:
"lora_target_modules": ["q_proj","k_proj","v_proj","o_proj"]
- Suggested by winglian:
11. Exp 10 (qkvo layers) + lora exploration
- Run: zw5oob7b, o9x0pgm9
- qkvo + similar lora settings from axolotl's Mixtral config: https://github.com/OpenAccess-AI-Collective/axolotl/blob/4d2e842e46bf8bd6dd0fda4d2667a7e7d80b4cd4/examples/mistral/mixtral.yml#L36C1-L38C19
- lora_r = 32 (from 16)
- lora_alpha = 16 (from 32)
"lora_target_modules": ["q_proj","k_proj","v_proj","o_proj"]
12. New image with transformers (87ae2a4)!
RUN: mon0n1c9
Exp 10 with new docker images that uses 87ae2a4
13. Experiment 10 but with lora_target_linear == False
Run: lzxhn4vd
lora_target_linear == False (was True)
"lora_target_linear" = False,"lora_target_modules": ["q_proj","k_proj","v_proj","o_proj"]
14. Experiment 10 using new image
- Validating everything matches experiment 1-0
15. Experiment 10 with higher lr
RUN: rsa1t68p
- How high is too high?
15.1: learning_rate = 0.005 # too high15.2: learning_rate = 0.002
16. New line of trainings with fixed re-entrant issue and lower max_seq_len
17. exp-16 + lora_r = 32, lora_a = 16
- comparing to axolotl mixtral yaml: https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/examples/mistral/mixtral.yml
RUN: o0x5xxk8
Results
Run set
19
Ideas
- max_grad_norm == 2.0 (up from 1.0 default)
- similar to here: https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/accessory/exps/finetune/sg/dialog_ultrachat200kWizardcode_mixtral.sh
- learning rate == 0.00002
- lora_r == 16
- ?
- ?
- ?
Resources
NeurIPS Efficiency Challenge
A100 winners
- Qwen-14b
- --finetuning_type lora \
- --lora_rank 8 \
- --lora_target "c_attn","c_proj" \
- learning_rate 5e-5
-
@dataclassclass lora_config:r: int=8lora_alpha: int=32target_modules: List[str] = field(default_factory=lambda: ["q_proj", "v_proj"])bias= "none"task_type: str= "CAUSAL_LM"lora_dropout: float=0.05inference_mode: bool = False@dataclassclass llama_adapter_config:adapter_len: int= 10adapter_layers: int= 30task_type: str= "CAUSAL_LM"@dataclassclass prefix_config:num_virtual_tokens: int=30task_type: str= "CAUSAL_LM"
-
learning_rate 3e-5 \weight_decayweight_decay 0.1warmup_ratio 0.04 \max_grad_norm 0.3 \lora_dropout 0.05 \lora_r 16 \lora_alpha 32 \bits 16 \double_quant \use_lora True \optim paged_adamw_32bit \
Add a comment