Skip to main content

Mixtral Experiments Notes

Searching for a strong Mixtral fine-tuning recipe
Created on December 30|Last edited on January 18
Training a Large Language Model (LLM) on your data is complex. It requires curating a high-quality dataset, choosing a good pre-trained model capable of solving your task, provisioning the necessary computing to run the training, and efficiently running the code to fine-tune the model. The data and the model are very specific to your needs, but we can help you provision the compute thanks to our partner CoreWeave and ensure that your training is as smooth as possible by leveraging W&B Launch.

Fine-Tuning Mixtral using Launch on CoreWeave

To test our brand new integration, we decided to grab a pod of the fastest GPUs on the planet Nvidia H100 with 80GB of VRAM, and fine-tune the best open-source model out there, the newly released Mistral.ai Mixture of Experts a.k.a Mixtral 8x7B.
To do so, we created a fine-tuning job by leveraging the Axolotl library; as this library already has W&B integration, we were finetuning a model quickly!

See results charts below

Experiments

We will perform instruction tuning using qLoRA on the newly released Mixtral 8x7B model. The code used for these experiments can be found here.
The main parameters of the training that are shared accros experiments are the dataset and the base model:
# from the axolotl configuration
base_model: mistralai/Mixtral-8x7B-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer

load_in_4bit: true

datasets:
- path: Open-Orca/SlimOrca
type: sharegpt
conversation: chatml
Dataset: We are using SlimOrca. This dataset has around 500k rows of GPT4 generated completitions in the chatML format.

1. Baseline

2. Baseline + target layers

  • 2 epoch
  • lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- w1
- w2
- w3

3. lora_taraget_linear == false

  • lora_target_linear == false
  • lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- w1
- w2
- w3

4. Baseline + higher learning rate 1e-3



5. Baseline 2 epochs + lr 1e-3 + lora_r = 16 + lora_a = 32 + lora_dropout = 0.05

  • lora_r = 16 (was 64)
  • lora_a = 32 (unchanged)
  • lora_dropout = 0.05 (from 0.1)
  • Based on NeurIPS Efficiency Challenge configs (see Resources section below)
  • 25 warmup steps
  • Run: 8zqoda2b


6. Baseline 2 epochs + lr 1e-3 + 150 warmup steps

  • warmup steps increased from 25 -> 150
  • Run: l1k9f0mh


7. Exp 5 + lower lr (2e-4)

  • Run: utkl02yv


8. Dolphin-2.7-Mixtral-8x7b config



9. Exp 5 + paged_lion_8bit optimizer + bs == 2

  • Run: k9pk3y3m, 64htlbhx, 8vzjmg4e
"optimizer": "paged_lion_8bit",
"micro_batch_size": 2
  • Higher BS might OOM, lets see if lion can free up enough memory -- it did OOM and also loss when off the chart, lr 1e-3 probably way too high, restarted with bs == 1


10. Exp 5 + qkvo layers

  • Run: ts0b0bq3
  • Changes:
"lora_target_modules": [
"q_proj",
"k_proj",
"v_proj",
"o_proj"
]
  • Suggested by winglian:


11. Exp 10 (qkvo layers) + lora exploration

"lora_target_modules": [
"q_proj",
"k_proj",
"v_proj",
"o_proj"
]


12. New image with transformers (87ae2a4)!

RUN: mon0n1c9
Exp 10 with new docker images that uses 87ae2a4


13. Experiment 10 but with lora_target_linear == False

Run: lzxhn4vd
lora_target_linear == False (was True)
"lora_target_linear" = False,
"lora_target_modules": [
"q_proj",
"k_proj",
"v_proj",
"o_proj"
]


14. Experiment 10 using new image

  • Validating everything matches experiment 1-0


15. Experiment 10 with higher lr

RUN: rsa1t68p
  • How high is too high?
15.1: learning_rate = 0.005 # too high
15.2: learning_rate = 0.002


16. New line of trainings with fixed re-entrant issue and lower max_seq_len



17. exp-16 + lora_r = 32, lora_a = 16

RUN: o0x5xxk8

Results


Run set
19




Ideas



Resources

NeurIPS Efficiency Challenge

A100 winners
@dataclass
class lora_config:
r: int=8
lora_alpha: int=32
target_modules: List[str] = field(default_factory=lambda: ["q_proj", "v_proj"])
bias= "none"
task_type: str= "CAUSAL_LM"
lora_dropout: float=0.05
inference_mode: bool = False

@dataclass
class llama_adapter_config:
adapter_len: int= 10
adapter_layers: int= 30
task_type: str= "CAUSAL_LM"

@dataclass
class prefix_config:
num_virtual_tokens: int=30
task_type: str= "CAUSAL_LM"
learning_rate 3e-5 \
weight_decay
weight_decay 0.1
warmup_ratio 0.04 \
max_grad_norm 0.3 \
lora_dropout 0.05 \
lora_r 16 \
lora_alpha 32 \
bits 16 \
double_quant \
use_lora True \
optim paged_adamw_32bit \