Mixtral Experiments Notes

Searching for a strong Mixtral fine-tuning recipe

Created on December 30|Last edited on January 18

Comment

Training a Large Language Model (LLM) on your data is complex. It requires curating a high-quality dataset, choosing a good pre-trained model capable of solving your task, provisioning the necessary computing to run the training, and efficiently running the code to fine-tune the model. The data and the model are very specific to your needs, but we can help you provision the compute thanks to our partner CoreWeave and ensure that your training is as smooth as possible by leveraging W&B Launch.
Fine-Tuning Mixtral using Launch on CoreWeave﻿To test our brand new integration, we decided to grab a pod of the fastest GPUs on the planet Nvidia H100 with 80GB of VRAM, and fine-tune the best open-source model out there, the newly released Mistral.ai Mixture of Experts a.k.a Mixtral 8x7B.﻿
To do so, we created a fine-tuning job by leveraging the Axolotl library; as this library already has W&B integration, we were finetuning a model quickly!
﻿
See results charts below
ExperimentsWe will perform instruction tuning using qLoRA on the newly released Mixtral 8x7B model. The code used for these experiments can be found here.
The main parameters of the training that are shared accros experiments are the dataset and the base model:
# from the axolotl configuration
base_model: mistralai/Mixtral-8x7B-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer
﻿
load_in_4bit: true
﻿
datasets:
  - path: Open-Orca/SlimOrca
    type: sharegpt
    conversation: chatml
Dataset: We are using SlimOrca. This dataset has around 500k rows of GPT4 generated completitions in the chatML format.
1. Baseline ﻿https://wandb.ai/reviewco/mixtral/runs/4oy1hsw2﻿
2. Baseline + target layers2 epoch
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- w1
- w2
- w3
3. lora_taraget_linear == falselora_target_linear == false
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- w1
- w2
- w3
4. Baseline + higher learning rate 1e-3﻿
5. Baseline 2 epochs + lr 1e-3 + lora_r = 16 + lora_a = 32 + lora_dropout = 0.05lora_r = 16 (was 64)
lora_a = 32 (unchanged)
lora_dropout = 0.05 (from 0.1)
Based on NeurIPS Efficiency Challenge configs (see Resources section below)
25 warmup steps
Run: 8zqoda2b
﻿
6. Baseline 2 epochs + lr 1e-3 + 150 warmup stepswarmup steps increased from 25 -> 150
Run: l1k9f0mh
﻿
7. Exp 5 + lower lr (2e-4)Run: utkl02yv
﻿
8. Dolphin-2.7-Mixtral-8x7b configRun: yvuz36xp
Config: https://huggingface.co/cognitivecomputations/dolphin-2.7-mixtral-8x7b/blob/main/configs/dolphin-mixtral-2.7-8x7b.yml﻿
lora_r = 32 
lora_a = 16 
lora_dropout = 0.05 
warmup steps = 10
weight_decay = 0.05 (previous was 0.1)
﻿
9. Exp 5 + paged_lion_8bit optimizer + bs == 2Run: k9pk3y3m, 64htlbhx, 8vzjmg4e
"optimizer": "paged_lion_8bit",
"micro_batch_size": 2
Higher BS might OOM, lets see if lion can free up enough memory -- it did OOM and also loss when off the chart, lr 1e-3 probably way too high, restarted with bs == 1
﻿
10. Exp 5 + qkvo layersRun: ts0b0bq3
Changes:
"lora_target_modules": [
	"q_proj",
	"k_proj",
	"v_proj",
	"o_proj"
]
Suggested by winglian: 
﻿
11. Exp 10 (qkvo layers) + lora explorationRun: zw5oob7b, o9x0pgm9
qkvo + similar lora settings from axolotl's Mixtral config: https://github.com/OpenAccess-AI-Collective/axolotl/blob/4d2e842e46bf8bd6dd0fda4d2667a7e7d80b4cd4/examples/mistral/mixtral.yml#L36C1-L38C19﻿
lora_r = 32 (from 16)
lora_alpha = 16 (from 32)
"lora_target_modules": [
	"q_proj",
	"k_proj",
	"v_proj",
	"o_proj"
]
﻿
12. New image with transformers (87ae2a4)!RUN: mon0n1c9
Exp 10 with new docker images that uses 87ae2a4
﻿
13. Experiment 10 but with lora_target_linear == False Run: lzxhn4vd
lora_target_linear == False (was True)
"lora_target_linear" = False,
"lora_target_modules": [
	"q_proj",
	"k_proj",
	"v_proj",
	"o_proj"
]
﻿
14. Experiment 10 using new imageValidating everything matches experiment 1-0
﻿
15. Experiment 10 with higher lr RUN: rsa1t68p
How high is too high?
15.1:  learning_rate = 0.005  # too high
15.2:  learning_rate = 0.002  
﻿
16. New line of trainings with fixed re-entrant issue and lower max_seq_len﻿
17. exp-16 + lora_r = 32, lora_a = 16comparing to axolotl mixtral yaml: https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/examples/mistral/mixtral.yml﻿
RUN: o0x5xxk8
Results﻿
Run set19
﻿
﻿
﻿
Ideasmax_grad_norm == 2.0 (up from 1.0 default)
	similar to here: https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/accessory/exps/finetune/sg/dialog_ultrachat200kWizardcode_mixtral.sh﻿
learning rate == 0.00002
	https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/accessory/exps/finetune/sg/dialog_ultrachat200kWizardcode_mixtralPeft.sh﻿
lora_r == 16
?
?
?
﻿
Resources
NeurIPS Efficiency ChallengeWinners: https://llm-efficiency-challenge.github.io/leaderboard.html﻿
A100 winners
﻿https://github.com/Percent-BFD/neurips_submission/blob/main/docker_train/LLaMA-Effcient-Tuning/train_qwen.sh    
Qwen-14b
--finetuning_type lora \
--lora_rank 8 \
 --lora_target "c_attn","c_proj" \
learning_rate 5e-5
﻿
2. https://github.com/anmolagarwal999/Submission-NeurIPS-Large-Language-Model-Efficiency-Challenge-2023﻿
lora: https://github.com/anmolagarwal999/Submission-NeurIPS-Large-Language-Model-Efficiency-Challenge-2023/blob/main/finetuning_dir/training_dockers_for_models/training_dir_for_reproduction/llama_recipes/train.py﻿
I think this is their config: https://github.com/anmolagarwal999/Submission-NeurIPS-Large-Language-Model-Efficiency-Challenge-2023/blob/main/finetuning_dir/training_dockers_for_models/training_dir_for_reproduction/llama_recipes/llama_recipes_external_code/src/llama_recipes/configs/peft.py﻿
@dataclass
class lora_config:
     r: int=8
     lora_alpha: int=32
     target_modules: List[str] = field(default_factory=lambda: ["q_proj", "v_proj"])
     bias= "none"
     task_type: str= "CAUSAL_LM"
     lora_dropout: float=0.05
     inference_mode: bool = False
﻿
@dataclass
class llama_adapter_config:
     adapter_len: int= 10
     adapter_layers: int= 30
     task_type: str= "CAUSAL_LM"
﻿
@dataclass
class prefix_config:
     num_virtual_tokens: int=30
     task_type: str= "CAUSAL_LM"  
﻿
3. https://github.com/MrigankRaman/LLM_Comp/﻿
- https://github.com/MrigankRaman/LLM_Comp/blob/master/A100_training_code/train.sh﻿
learning_rate 3e-5 \
weight_decay
weight_decay 0.1 
warmup_ratio 0.04 \
max_grad_norm 0.3 \
lora_dropout 0.05 \
lora_r 16 \
lora_alpha 32 \
bits 16 \
double_quant \
use_lora True \
optim paged_adamw_32bit \
  
﻿

Add a comment

Mixtral Experiments Notes

Fine-Tuning Mixtral using Launch on CoreWeave﻿

Experiments

1. Baseline

2. Baseline + target layers

3. lora_taraget_linear == false

4. Baseline + higher learning rate 1e-3

5. Baseline 2 epochs + lr 1e-3 + lora_r = 16 + lora_a = 32 + lora_dropout = 0.05

6. Baseline 2 epochs + lr 1e-3 + 150 warmup steps

7. Exp 5 + lower lr (2e-4)

8. Dolphin-2.7-Mixtral-8x7b config

9. Exp 5 + paged_lion_8bit optimizer + bs == 2

10. Exp 5 + qkvo layers

11. Exp 10 (qkvo layers) + lora exploration

12. New image with transformers (87ae2a4)!

13. Experiment 10 but with lora_target_linear == False

14. Experiment 10 using new image

15. Experiment 10 with higher lr

16. New line of trainings with fixed re-entrant issue and lower max_seq_len

17. exp-16 + lora_r = 32, lora_a = 16

Results

Ideas

Resources

NeurIPS Efficiency Challenge

Fine-Tuning Mixtral using Launch on CoreWeave