Trainer performance comparison: torchtune vs. axolotl vs. Unsloth
An interactive performance comparison of various trainers and GPUs
Created on June 17|Last edited on June 19
Comment
This testing was conducted from June 15th through 17th (2024) on the latest drivers and software version (git HEAD). It'll probably be out of date soon, but hopefully will serve as a useful comparison and starting point for any future evaluations you make.
💡
Summary
Today, we're going to look at the performance of torchtune, axolotl, and Unsloth. This started as a project to test torchtune on AMD RDNA3 GPUs (that's it's own writeup—it worked but not without some futzing) and, I chose the /llama3/8B_lora.yaml recipe as a base(Llama 3 8B LoRA tuning yahma/alpaca_cleaned, r=8 alpha=16) and matched that as a baseline for comparison across other tuners.
The takeaway here is that, basically, if you have a supported Nvidia GPU and are fine-tuning on a supported model architecture on a single GPU, then you should probably start with Unsloth. At as close to one-to-one testing as possible, unsloth appears to train 24% faster on a 4090 and 28% on a 3090 than torchtune with torch.compile(). It also uses significantly less memory allowing you to increase batch size or sequence length.
There might have been something I'm still missing (I caught a number of errors in my original results) so you can check out the config and training scripts here.
I'm also sharing my wandb logs, but I wasn't able to figure out how to apply grouping post-facto or match some of the reported numbers from different trainers together, so it's not so useful for comparison as I'd like. Here's a spreadsheet I made with a bit more of a sensible summary:

Note: sfttrainer supports an include_tokens_per_second=True TrainingArgument that will let you compare with torchtune.
💡
RTX 4090 vs RTX 3090 vs W7900
First, let's get some hardware comparisons out the way. Even using AdamW8bit, we're only able to fit bsz=1 in 24GB. The W7900 can run at bsz=2 (ends up at about 26GB) and gets a small (~6%) speed boost from that. The 4090 is about 2.4 times faster than the 3090 or W7900 (which despite drastically different specs*, these surprisingly turn out almost exactly the same perf numbers). If you're doing a lot of local training, Ada is definitely worth the price premium.
* Per Nvidia's GA102 Technical Paper (pg. 44), the RTX 3090 has 71 FP16 Tensor TFLOPS (w/ FP32 accumulate, no sparsity, although it can be doubled to 142 if there is sparsity or with FP16 accumulate) with 936GB/s of MBW while the W7900 has a theoretical peak of 122.64 FP16 TFLOPS and 864 GB/s of memory bandwidth.
W7900 Batch Size
Since we have 48GB of VRAM on the W7900, we can play around a bit with batch size and sequence length. Here, it looks like bsz=2 is the best performing (+6% over bsz=1) and bsz=4 barely improves over bsz=1. This was a bit surprising to me, but it is what it is. (sequence length seemed to have neglible efficiency effect on initial testing, but probably should be tested loss. Note, on Linux, the power is limited by the drivers to 240W (vs the 295W spec).
W7900 vs 7900 XTX
While we're looking at hardware, I also have a 7900 XTX and it performs almost 17% faster than the W7900. The 7900 XTX uses the same chip as the W7900 (RDNA3 gfx1100 Navi 31), it but it has a higher 303W (+26%) power limit in Linux (again lower than the 355W spec in Windows), with a theoretical 122.78 TFLOPS (+0%) and 960.0 GB/s (+11%) memory bandwidth. Based on the scaling, it seems like there's more performance in the tank for the Navi 31 cores if you could feed it more power.
torch.compile()
The torchtune README briefly mentions torch compile and you can set compile: True in your config. When enabled, the first pass has an upfront cost for the compilation, but after that, subsequent passes should run faster. It turns out that for even with a relatively small tune (alpaca_cleaned is 51.8K samples (~9M tokens?) on ~20M trainable parameters for 1 epoch), the performance difference is still worth it. So I'd say in general, for real world tuning, you will probably always gain efficiency and end with shorter runs with torch.compile()
RTX 4090
On the 4090, the torch compile version gets a 15.3% throughput boost and an overall 11.7% runtime reduction.
RTX 3090
The torch compile version gets a 12.8% throughput boost and an overall 10.7% runtime reduction on the 3090.
W7900
On the W7900, our torch compile version gets a whopping 19.6% throughput boost and an overall 16% runtime reduction (both bsz=1 runs for comparison).
torchtune vs axolotl
I was focused on torchtune, so I only did a cursory comparison with axolotl, which despite being bleeding edge (and often buggy—it's a bit of a crapshoot anytime you update), has been my preferred trainer lately. I did a single run just to check to see if there would be much performance difference.
Despite completely different sets of optimizations, it turns out there's not that much of a difference in training times - axolotl is only 3% slower and that might actually be less for real world training since axolotl does a lot more pre/post "stuff." Note, axolotl can also take a torch_compile flag so you should be able to get similar benefits from torch compile (I didn't test it, though).
Unsloth
- Currently only single GPU tuning is supported (OSS multi-GPU "soon")
- Supports only NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc)
Unsloth manages to train significantly faster than the torch compiled torchtune (and uses significantly less memory on both 3090 and 4090).
4090
Unsloth is 23.4% faster than the torch compiled version and uses 17.7% less memory.
3090
For the 3090, Unsloth does even better at processing - 27.1% faster and 16.9% less memory.
One other difference with Unsloth is that instead of loading a YAML file, you set up unsloth with a simple Python script directly. I actually sort of prefer this (eg, I'd much rather directly see and apply the chat template so I know exactly what's being fed in).
Note: Due to packing differences the steps are different, but I did edit the torchtune AlpacaInstructTemplate to try to make sure that the formatting matched.
With those caveats and notes out of the way though, these tests seem to confirm that if you can use Unsloth, you should give it a spin first, especially if you have longer/many single-GPU training runs. Saving 25% on time and costs is not nothing and even with RoPE scaling etc, having longer sequence lengths for training is still better.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.