Skip to main content

What will you do with 10,000 H100 Hours?

Created on July 23|Last edited on July 24
The first post didn’t take off as I’d hoped, so I’m trying a punchier title this time. Since then, we’ve teamed up with KISTI and ORACLE to run about 50 training jobs on two H200 nodes and two H100 nodes, racking up well over 2000 H200‑hours (equivalent to 4000 H100 hours, I suppose) in just one week.
As of now, we've collected over 10 million samples for Korean post-training. Training for five epochs on this would exceed our budget, requiring several thousand hours on top of what we’ve already spent on H200. So, the goal is to determine the optimal post-training mix, which involves deciding which data to retain, which to discard, and which to upsample for optimal performance. Here are the lessons we learned.

Dataset Composition

Before jumping into results, here are some details on how our dataset was created. We have the following seven categories:
  • Openthought3: We translated openthought3 prompts into Korean with gemini‑2.5‑flash‑lite‑preview‑06‑17, then discarded entries whose length shifted too much after translation.
  • rStar-Coder: Added only in the latest runs, we follow the same process as above to build this subset using prompts from rStar-Coder.
  • Web-Daily/Code/Science/Medical: These are Korean instructions scraped from the web and tagged by topic. We remove samples that embed images or are unreasonably short or long. Web‑Daily deliberately retains noisy language and occasional unsafe prompts to stress‑test robustness, while the Code, Science, and Medical splits focus on their respective domains with cleaner prose and domain‑specific terminology.
  • MCQA-Augmented: Starting from the KMMLU-Train subset, we inject diverse answer styles, such as “Answer N” or \boxed{N}, and retrieve similar questions via BM25. We remix their options, sometimes yielding up to ten choices per question.
For all these subsets, we generate the response with Qwen3-32 B. This entire process took nearly 4,000 H100 hours, and, by the way, it is still running.

Evaluation Settings

To prevent benchmark overfitting, we maintain a simple train-test split. During training, we validate only on KMMLU-Redux, MCLM-Ko, HAE-RAE Bench, and Clinical QA. Six other benchmarks, including KMMLU-Pro, GPQA, HRM8K, LiveCodeBench, CLiCK, and KoBALT, remain completely held out until the final evaluation.

For simplicity, the evaluation is run only once per model using the following setting:
{temperature = 0.7, top_p = 0.9, max_tokens = 32768}

Training Results

Ablation #1: Performance per category.

We begin by measuring the marginal gains from each data bucket—OpenThought3, the four Web splits, and MCQA‑Augmented. To ensure the results are model-agnostic, every experiment is run in parallel on two bases: Kanana‑1.5‑8B‑Instruct and Gemma3‑4B‑Instruct. For every run, we include a minimum amount of MCQA data to make sure the formatting works properly.
Training & Evaluation results on Kanana-1.5-8B-Instruct.
Training & Evaluation results on Gemma-3-4b-it.
The results are clear. OpenThought3 is the standout, boosting every benchmark, including the knowledge‑heavy HAE‑RAE Bench, suggesting the base models weren’t fully leveraging their parametric knowledge. MCQA‑Augmented delivers the next‑largest gains, followed by Web‑Science and Web‑Code. Contrary to expectations, Web-Medical barely moves ClinicalQA scores, so we initiated a second run with 100,000 Web-Medical instructions to see if scaling unlocks hidden value.
Training & Evaluation results with scaled web-medical data.
From our experiments, scaling the Web‑Medical dataset offers little to no benefit and even causes performance drops on some benchmarks across both models. As a result, we decided to exclude it entirely from our final training runs. (ditching about 800k medical data we had. Reflecting on this, we should have done this earlier, before data generation, so we don't waste compute on data we might never use. My bad.)

Ablation #2: Cross-category benefits.

Then we became interested in some hidden relationship between each data bucket that would give us surplus performance boosts. So we start mixing datasets.

Training Kanana on Ko‑R1‑3.0.5 and 3.0.6 confirms our earlier decision: swapping Web‑Medical for Web‑Code lifts scores on both HAE‑RAE Bench and MCLM, underscoring Web‑Medical’s negligible value. The results also suggest an unexpected synergy between Web-Code data and the MCLM benchmark. Here, we began searching for higher-quality coding datasets and decided to translate and incorporate rStar-Coder into our mix.
On the other hand, Gemma benefits less from using code data.

Ablation #3: To Augment or Not To Augment

Next, we weighed how far to push KMMLU‑Train augmentation. The base set has roughly 200k items, but adding stylistic prompt variants could inflate it past a million. We also tested option augmentation (sth similar to MMLU‑Pro) by using BM25 to fetch similar questions and splice in their answer choices, producing items with varying numbers of options. We randomly sampled 50k from each augmentation and trained in separate runs.

Option augmentation backfired on most MCQA benchmarks, except for MT-MATH100. And we had to throw away the extra million samples we generated. Two likely causes: our BM25‑based merges produced noisy distractors, and the benchmarks we have expect four options, so adding training data with more choices may be counter‑productive. For the final run, we decided to retain only a small slice of option-augmented data for diversity.

Ablation #4: Korean-English Mix

For our final model, we plan to incorporate English data as well, thereby increasing the number of data points. But how much English data can we mix without degrading Korean performance? So we try 1:1 and 1:4 mix (ko:en).

For Kanana, more English data drags down KMMLU-Redux and HAE-RAE Bench, while adding rStar-Coder lifts math capabilities, similar to what we observed earlier. Gemma, by contrast, tolerates English mixing with virtually no loss. The difference suggests Kanana still relies on ample Korean reasoning data to excel, whereas Gemma can absorb English without penalty.

Extra Analysis

Here are some random analyses that caught my attention.

Run set
2



Run set
2

As you can see from the two examples above, even with identical data, Gemma exhibits sharper grad-norm fluctuations and occasional loss spikes; yet, these irregularities do not noticeably affect the quality of the final checkpoint.


Run set
4

Also, in single‑GPU runs without gradient checkpointing (removing gradient checkpointing is faster if you have big enough RAM btw) or FSDP (Kanana‑308/309), the loss curves are smooth. Multi‑GPU setups, however, show staircase‑like drops at every epoch boundary. We couldn’t remove this artifact, but it doesn’t appear to affect final performance anyway.