Skip to main content

Nejumi LLM Leaderboard3

Evaluate the Japanese language capabilities of prominent LLM models from the broad perspectives of language comprehension, application skills, and alignment
Created on June 28|Last edited on November 14


① Features of the Nejumi Leaderboard
  • Evaluates Japanese language capabilities of prominent LLM models from the broad perspectives of language comprehension, application skills, and alignment 📊
  • Using WandB's Table feature, it's possible to dive deep into individual questions rather than just average scores 🔍
  • Allows interactive selection of models for comparison 🎰
  • Ability to trace back to actual experiments from WandB Reports 🛣️
  • Evaluation scripts are also public! Possible to build private leaderboards in-house 🤫
② Nejumi Leaderboard Evaluation Scripts
The evaluation scripts and evaluation methods are publicly available at:
③ Leaderboard Usage and Inquiries
Some datasets in the ALIGNMENT (ALT) data prohibit redistribution or unauthorized commercial use. When verifying yourself, please check the license terms. When issuing press releases, please use only the General Language Processing (GLP) as an indicator.
💡
For general inquiries about this leaderboard or if you're interested in creating a private leaderboard, please contact contact-jp@wandb.com
④ Information on LLM Evaluation
For those interested in the taxonomy, evaluation metrics, and evaluation methods used in the leaderboard, please refer to the "Best Practices for LLM Evaluation" section in the W&B whitepaper. Note that Weights & Biases has also published other LLM whitepapers which you may find useful.



Overall Evaluation

  • For llm-jp-eval (jaster), we use a 2-shot approach and calculate the evaluation for 100 questions in each test dataset. For Wiki data, we've set the number of questions to total 100 across the entire dataset.
  • Each score is scaled from 0 to 1 (with 1 being superior) before aggregation, resulting in an average score out of 1 point.
  • Definitions:
GLP : General Language Processing
ALT : Alignment
Total AVG = (Avg. GLP + Avg. ALT)/2

Run set
99



Run set
0



GLP_expression

🗂️ Relevant Evaluation Datasets/Frameworks

MT-bench: roleplay, humanities, writing

📋 Results

🔍 Detailed Results

GLP_translation

🗂️ Relevant Evaluation Datasets/Frameworks

jaster: alt-e-to-j, alt-j-to-e, wikicorpus-e-to-j, wikicorpus-j-to-e (Each conducted with 0-shot and 2-shot)

📋 Results

🔍 Detailed Results

GLP_summarization

Not implemented yet

GLP_information extraction

🗂️ Relevant Evaluation Datasets/Frameworks

jaster: jsquad (Each conducted with 0-shot and 2-shot)

📋 Results

🔍 Detailed Results

GLP_reasoning

🗂️ Relevant Evaluation Datasets/Frameworks

MT-bench: reasoning

📋 Results

🔍 Detailed Results

GLP_mathematical reasoning

🗂️ Relevant Evaluation Datasets/Frameworks

jaster: mawps, mgsm (Each conducted with 0-shot and 2-shot)
MT-bench: math

📋 Results

🔍 Detailed Results

GLP_entity extraction

🗂️ Relevant Evaluation Datasets/Frameworks

jaster: wiki_ner, wiki_coreference, chABSA (Each conducted with 0-shot and 2-shot)
MT-bench: extraction

📋 Results

🔍 Detailed Results

GLP_knowledge/QA

🗂️ Relevant Evaluation Datasets/Frameworks

jaster: JCommonsenseQA, JEMHopQA, JMMLU, NIILC, aio(Each conducted with 0-shot and 2-shot)
MT-bench: stem

📋 Results

🔍 Detailed Results

GLP_english

🗂️ Relevant Evaluation Datasets/Frameworks

jaster: MMLU (Each conducted with 0-shot and 2-shot)

📋 Results

🔍 Detailed Results

GLP_semantic analysis

🗂️ Relevant Evaluation Datasets/Frameworks

jaster: JNLI, JaNLI, JSeM, JSICK, Jamp(Each conducted with 0-shot and 2-shot)

📋 Results

🔍 Detailed Results

GLP_syntactic analysis

🗂️ Relevant Evaluation Datasets/Frameworks

jaster: JCoLA-in-domain, JCoLA-out-of-domain, JBLiMP, wiki_reading, wiki_pas, wiki_dependency (Each conducted with 0-shot and 2-shot)

📋 Results

🔍 Detailed Results

ALT_controllability

🗂️ Relevant Evaluation Datasets/Frameworks

jaster: Metrics that can automatically evaluate the format of responses (chabsa, commonsensemoralja, jamp, janli, jblimp, jcola-in-domain, jcola-out-of-domain, jcommonsenseqa, jmmlu, jnli, jsem, jsick, kuci, mawps, mgsm, mmlu_en, wiki_dependency, wiki_ner)
LCTG bench

📋 Results

🔍 Detailed Results

ALT_ethics

🗂️ Relevant Evaluation Datasets/Frameworks

jaster: JcommonsenseMorality

📋 Results

🔍 Detailed Results

ALT_toxicity

🗂️ Relevant Evaluation Datasets/Frameworks

📋 Results

🔍 Detailed Results

ALT_bias

🗂️ Relevant Evaluation Datasets/Frameworks

JBBQ

📋 Results

🔍 Detailed Results

ALT_truthfulness

Not implemented yet

ALT_robustness

🗂️ Relevant Evaluation Datasets/Frameworks

📋 Results

🔍 Detailed Results

Appendix

Deeper dive into llm-jp-eval

llm-jp-eval overview (0 shot / 4shot)

llm-jp-eval leaderboard details (0 shot / 4shot)

Deeper dive into MT-bench-jp

MT-bench overview

MT-bench output details

Taxonomy

Explanation of evaluation datasets

llm-jp-eval (jaster)

Japanese MT-Bench

LCTG-Bench

LINE Yahoo Reliability Evaluation Dataset

BBQ/JBBQ

MMLU/JMMLU

MT-bench-ja leaderboard details

MT-bench-jp output details