Horangi LLM Leaderboard: Evaluating Korean Language Proficiency

Assess the Korean language proficiency of a prominent LLM model from both the perspectives of language comprehension and generation abilities.

Paul-Hyun, Kim, Ki Hyun, Kei Kamata, Akira Shibata

Created on March 1|Last edited on April 1

Comment

﻿
﻿
The 'Horangi LLM Leaderboard' presents a novel approach to evaluating the Korean language proficiency of leading Large Language Models (LLMs). This comprehensive assessment platform utilizes two key tools:
the 'llm-kr-eval' for language comprehension in a Q&A format, (llm-kr-eval is a Korean language version that was developed based on llm-jp-eval for this leaderboard, which was originally developed in Japan.) and
'MT-Bench' for assessing generative capabilities through prompt dialogues.
The Horangi LLM Leaderboard Neo employs a rigorous zero-shot evaluation method with W&B's Table feature, allowing in-depth analysis of each question. The leaderboard enables interactive model comparisons and traces back to the original experiments.
This article delves into the leaderboard's features, evaluation methodology, detailed analyses, and specific evaluation tasks, offering insights into cutting-edge advancements in LLM performance assessment.
Here's what we'll be covering:
Features of the Horangi Leaderboard  🐅The LLM Evaluation MethodEvaluation by CategoryDeep Dive into llm-kr-evalDeep Dive into MT-bench-krExplanation of Evaluation Tasks
﻿
Features of the Horangi Leaderboard  🐅Assessment of the Korean language proficiency of prominent LLM models﻿
Comprehensive evaluation using llm-kr-eval for assessing language comprehension in a Q&A format and MT-Bench for evaluating generative abilities through prompt dialogues 👓
For llm-kr-eval, a stringent zero-shot evaluation to measure the model's raw capabilities 🌶️ 
Using W&B's Table feature, not just average scores but also in-depth analysis of each question is possible  🔍 
Ability to interactively select models for comparison  🎰
Trace back to the actual experiments conducted from W&B's Report  🛣️
﻿
For those interested in learning more about this leaderboard, please see the following blog posts:
Horangi LLM 리더보드: 또 다른 LLM 공개 평가에 대한 대안
﻿
For running the leaderboard, please use Weights & Biases. For those interested in LLM development, W&B's whitepaper is also recommended.
For inquiries regarding this leaderboard in general, please contact contact-kr@wandb.com.
Weights & Biases platform
Weights & Biases helps AI developers build better models faster. Quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, and manage your ML workflows end-to-end.
Free Guide: How to Train LLMs from Scratch
The best teams building the large language models transforming our world train those models on Weights & Biases.
﻿
The LLM Evaluation Method※ For llm-kr-eval, zero-shot learning is used, and the evaluation is calculated based on 100 questions for each test data set. For data from Wiki, the number of data is set to a total 100 questions.
Overall average = (llm-kr-eval + MT-bench/10) / 2
﻿
Run set31
﻿
Evaluation by CategoryYou can check the scores for each category of LLM-kr-eval and MT-bench-kr (categories will be explained later). Please select the model you want to compare by pressing the 👁️ button from the table below."
﻿
Model list4
﻿
﻿
Deep Dive into llm-kr-eval
Detailed Analysis of the llm-kr-eval Leaderboard﻿
﻿
Model list31
﻿
Detailed Output of llm-kr-eval
List of OutputsSelect the model you want to check by pressing the 👁️ mark from the Model list. For example, if you want to filter by the category 'coding', press the ▽ button on the bottom left of runs.summary["kaster_output_table_dev"] and enter the following query (refer to this article for a general explanation of queries).
row["target_dataset"]=="kornli"
For the examples of output, we have used 20 questions from each development data set. Please note that test data is not used in the example questions displayed below.
💡
﻿
Model list31
﻿
Deep Dive into MT-bench-kr
Detailed Analysis of the MT-bench-kr Leaderboard﻿
Model list31
﻿
MT-bench-kr output
You can display the model you want to check by pressing the 👁️ mark from the Model list. For instance, if you want to filter by the category 'coding', press the ▽ button on the bottom left of runs.summary["mtbench_output_table"] and enter the following query. (refer to this article for a general explanation of queries).
row["category"]=="coding"
﻿
Model list31
﻿
Explanation of Evaluation TasksThis leaderboard is primarily operated by Weights & Biases. It presents the evaluation results of open and proprietary LLM models for the following tasks.
If you have requests for additional model validations, please contact us at contact-kr@wandb.com using your corporate or organization email address. Additionally, since our GitHub is public, you are also welcome to conduct evaluations in your own environment.
💡
For the evaluation tasks, we are using the following:
The evaluation framework and datasets of llm-kr-eval, which was developed based on llm-jp-eval for this leaderboard, which was originally developed in Japan. 
MT-bench, published by lm-sys, is used for tasks. The Korean tasks were created by Weights & Biases.
The GitHub for this leaderboard is ﻿﻿https://github.com/wandb/llm-leaderboard/tree/korean. Please feel free to use it. We also accept pull requests.
llm-kr-evalllm-kr-eval is was developed based on llm-jp-eval for this leaderboard, which was originally developed in Japan. llm-kr-eval offers datasets that have been preprocessed from publicly available evaluation datasets. This offers datasets (kaster :k + asterisk) for both tuning and evaluation purposes (details: Dataset.md in llm-kr-eval github).  Additionally, llm-kr-eval is also made available on GitHub! llm-kr-eval provides the following features:
Generating instruction data (kaster) in the same format as the evaluation data prompts.
Utilizing existing Korean evaluation data to convert into datasets for text generation tasks.
Executing evaluations of large-scale language models across multiple datasets.
The list of supported datasets is as follows. Note that 'exact', 'char f1', and 'set f1' indicate the methods of dataset evaluation, with 'exact' for exact match, 'char f1' for character-based f1 score, and 'set f1' for sentence-based f1 score.
Examples from each dataset are shown in alpaca format, but the format is appropriately modified to suit the model.
NLI (Natural Language Inference): KorNLI(exact), KoBEST_HellaSwag(exact), KoBEST_COPA(exact)﻿﻿﻿
QA (Question Answering): KoBEST_WiC(exact), ﻿KMMLU(exact)﻿
RC (Reading Comprehension): KorSTS(person, spearman), ﻿KoBEST_SN(exact)﻿﻿﻿
EL (Entity Linking) : KLUE-NER(set_f1), KLUE-RE(exact)﻿﻿﻿
FA (Fundamental Analysis): ﻿﻿﻿﻿Korean-CommonGen(bleu)﻿
MT-bench MT-bench is a meticulously curated benchmark for LLMs, developed by llm-sys, that includes multi-turn questions (paper / github). There was no Korean dataset for MT-bench, but it was prepared for this leaderboard (Korean tasks GitHub). These questions are designed to assess the ability of LLMs to follow the flow and instructions of model conversations in multi-turn dialogues. They include both "general use cases" and "challenging instructions." There are a total of 80 questions, categorized into the following eight categories.
Writing
Roleplay
Extraction
Reasoning
Math
Coding
Knowledge I (STEM)
Knowledge II (humanities/social science)
The following figure is a citation from the original paper, showing an example of the English version of the problems. 
﻿https://arxiv.org/abs/2306.05685﻿
﻿

Add a comment

Tags: Articles, LLM, Benchmark

Iterate on AI agents and models faster. Try Weights & Biases today.

Horangi LLM Leaderboard: Evaluating Korean Language Proficiency

Features of the Horangi Leaderboard 🐅

The LLM Evaluation Method

Evaluation by Category

Deep Dive into llm-kr-eval

Detailed Analysis of the llm-kr-eval Leaderboard

Detailed Output of llm-kr-eval

List of Outputs

Deep Dive into MT-bench-kr

Detailed Analysis of the MT-bench-kr Leaderboard

MT-bench-kr output

Explanation of Evaluation Tasks

llm-kr-eval

NLI (Natural Language Inference): KorNLI(exact), KoBEST_HellaSwag(exact), KoBEST_COPA(exact)

QA (Question Answering): KoBEST_WiC(exact), KMMLU(exact)

RC (Reading Comprehension): KorSTS(person, spearman), KoBEST_SN(exact)

EL (Entity Linking) : KLUE-NER(set_f1), KLUE-RE(exact)

FA (Fundamental Analysis): Korean-CommonGen(bleu)

MT-bench

Horangi LLM Leaderboard: Evaluating Korean Language Proficiency

Features of the Horangi Leaderboard 🐅

The LLM Evaluation Method

Evaluation by Category

Deep Dive into llm-kr-eval

Detailed Analysis of the llm-kr-eval Leaderboard

Detailed Output of llm-kr-eval

List of Outputs

Deep Dive into MT-bench-kr

Detailed Analysis of the MT-bench-kr Leaderboard

MT-bench-kr output

Explanation of Evaluation Tasks

llm-kr-eval

NLI (Natural Language Inference): KorNLI(exact), KoBEST_HellaSwag(exact), KoBEST_COPA(exact)﻿﻿﻿

QA (Question Answering): KoBEST_WiC(exact), ﻿KMMLU(exact)﻿

RC (Reading Comprehension): KorSTS(person, spearman), ﻿KoBEST_SN(exact)﻿﻿﻿

EL (Entity Linking) : KLUE-NER(set_f1), KLUE-RE(exact)﻿﻿﻿

FA (Fundamental Analysis): ﻿﻿﻿﻿Korean-CommonGen(bleu)﻿

MT-bench

NLI (Natural Language Inference): KorNLI(exact), KoBEST_HellaSwag(exact), KoBEST_COPA(exact)

QA (Question Answering): KoBEST_WiC(exact), KMMLU(exact)

RC (Reading Comprehension): KorSTS(person, spearman), KoBEST_SN(exact)

EL (Entity Linking) : KLUE-NER(set_f1), KLUE-RE(exact)

FA (Fundamental Analysis): Korean-CommonGen(bleu)