Horangi LLM Leaderboard: Evaluating Korean Language Proficiency
Assess the Korean language proficiency of a prominent LLM model from both the perspectives of language comprehension and generation abilities.
Created on March 1|Last edited on April 1
Comment

The 'Horangi LLM Leaderboard' presents a novel approach to evaluating the Korean language proficiency of leading Large Language Models (LLMs). This comprehensive assessment platform utilizes two key tools:
- the 'llm-kr-eval' for language comprehension in a Q&A format, (llm-kr-eval is a Korean language version that was developed based on llm-jp-eval for this leaderboard, which was originally developed in Japan.) and
- 'MT-Bench' for assessing generative capabilities through prompt dialogues.
The Horangi LLM Leaderboard Neo employs a rigorous zero-shot evaluation method with W&B's Table feature, allowing in-depth analysis of each question. The leaderboard enables interactive model comparisons and traces back to the original experiments.
This article delves into the leaderboard's features, evaluation methodology, detailed analyses, and specific evaluation tasks, offering insights into cutting-edge advancements in LLM performance assessment.
Here's what we'll be covering:
Features of the Horangi Leaderboard 🐅The LLM Evaluation MethodEvaluation by CategoryDeep Dive into llm-kr-evalDeep Dive into MT-bench-krExplanation of Evaluation Tasks
Features of the Horangi Leaderboard 🐅
- Assessment of the Korean language proficiency of prominent LLM models
- Comprehensive evaluation using llm-kr-eval for assessing language comprehension in a Q&A format and MT-Bench for evaluating generative abilities through prompt dialogues 👓
- For llm-kr-eval, a stringent zero-shot evaluation to measure the model's raw capabilities 🌶️
- Using W&B's Table feature, not just average scores but also in-depth analysis of each question is possible 🔍
- Ability to interactively select models for comparison 🎰
- Trace back to the actual experiments conducted from W&B's Report 🛣️
For those interested in learning more about this leaderboard, please see the following blog posts:
For running the leaderboard, please use Weights & Biases. For those interested in LLM development, W&B's whitepaper is also recommended.
For inquiries regarding this leaderboard in general, please contact contact-kr@wandb.com.

Weights & Biases platform
Weights & Biases helps AI developers build better models faster. Quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, and manage your ML workflows end-to-end.

Free Guide: How to Train LLMs from Scratch
The best teams building the large language models transforming our world train those models on Weights & Biases.
The LLM Evaluation Method
※ For llm-kr-eval, zero-shot learning is used, and the evaluation is calculated based on 100 questions for each test data set. For data from Wiki, the number of data is set to a total 100 questions.
Overall average = (llm-kr-eval + MT-bench/10) / 2
Run set
31
Evaluation by Category
You can check the scores for each category of LLM-kr-eval and MT-bench-kr (categories will be explained later). Please select the model you want to compare by pressing the 👁️ button from the table below."
Model list
4
Deep Dive into llm-kr-eval
Detailed Analysis of the llm-kr-eval Leaderboard
Model list
31
Detailed Output of llm-kr-eval
List of Outputs
Select the model you want to check by pressing the 👁️ mark from the Model list. For example, if you want to filter by the category 'coding', press the ▽ button on the bottom left of runs.summary["kaster_output_table_dev"] and enter the following query (refer to this article for a general explanation of queries).
row["target_dataset"]=="kornli"
For the examples of output, we have used 20 questions from each development data set. Please note that test data is not used in the example questions displayed below.
💡
Model list
31
Deep Dive into MT-bench-kr
Detailed Analysis of the MT-bench-kr Leaderboard
Model list
31
MT-bench-kr output
You can display the model you want to check by pressing the 👁️ mark from the Model list. For instance, if you want to filter by the category 'coding', press the ▽ button on the bottom left of runs.summary["mtbench_output_table"] and enter the following query. (refer to this article for a general explanation of queries).
row["category"]=="coding"
Model list
31
Explanation of Evaluation Tasks
This leaderboard is primarily operated by Weights & Biases. It presents the evaluation results of open and proprietary LLM models for the following tasks.
If you have requests for additional model validations, please contact us at contact-kr@wandb.com using your corporate or organization email address. Additionally, since our GitHub is public, you are also welcome to conduct evaluations in your own environment.
💡
For the evaluation tasks, we are using the following:
- The evaluation framework and datasets of llm-kr-eval, which was developed based on llm-jp-eval for this leaderboard, which was originally developed in Japan.
- MT-bench, published by lm-sys, is used for tasks. The Korean tasks were created by Weights & Biases.
The GitHub for this leaderboard is https://github.com/wandb/llm-leaderboard/tree/korean. Please feel free to use it. We also accept pull requests.
llm-kr-eval
llm-kr-eval is was developed based on llm-jp-eval for this leaderboard, which was originally developed in Japan. llm-kr-eval offers datasets that have been preprocessed from publicly available evaluation datasets. This offers datasets (kaster :k + asterisk) for both tuning and evaluation purposes (details: Dataset.md in llm-kr-eval github). Additionally, llm-kr-eval is also made available on GitHub! llm-kr-eval provides the following features:
- Generating instruction data (kaster) in the same format as the evaluation data prompts.
- Utilizing existing Korean evaluation data to convert into datasets for text generation tasks.
- Executing evaluations of large-scale language models across multiple datasets.
The list of supported datasets is as follows. Note that 'exact', 'char f1', and 'set f1' indicate the methods of dataset evaluation, with 'exact' for exact match, 'char f1' for character-based f1 score, and 'set f1' for sentence-based f1 score.
Examples from each dataset are shown in alpaca format, but the format is appropriately modified to suit the model.
NLI (Natural Language Inference): KorNLI(exact), KoBEST_HellaSwag(exact), KoBEST_COPA(exact)
QA (Question Answering): KoBEST_WiC(exact), KMMLU(exact)
RC (Reading Comprehension): KorSTS(person, spearman), KoBEST_SN(exact)
EL (Entity Linking) : KLUE-NER(set_f1), KLUE-RE(exact)
FA (Fundamental Analysis): Korean-CommonGen(bleu)
MT-bench
MT-bench is a meticulously curated benchmark for LLMs, developed by llm-sys, that includes multi-turn questions (paper / github). There was no Korean dataset for MT-bench, but it was prepared for this leaderboard (Korean tasks GitHub). These questions are designed to assess the ability of LLMs to follow the flow and instructions of model conversations in multi-turn dialogues. They include both "general use cases" and "challenging instructions." There are a total of 80 questions, categorized into the following eight categories.
- Writing
- Roleplay
- Extraction
- Reasoning
- Math
- Coding
- Knowledge I (STEM)
- Knowledge II (humanities/social science)
The following figure is a citation from the original paper, showing an example of the English version of the problems.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.