Nejumi LLM Leaderboard3
Evaluate the Japanese language capabilities of prominent LLM models from the broad perspectives of language comprehension, application skills, and alignment
Created on June 28|Last edited on November 14
Comment

① Features of the Nejumi Leaderboard
- Evaluates Japanese language capabilities of prominent LLM models from the broad perspectives of language comprehension, application skills, and alignment 📊
- Using WandB's Table feature, it's possible to dive deep into individual questions rather than just average scores 🔍
- Allows interactive selection of models for comparison 🎰
- Ability to trace back to actual experiments from WandB Reports 🛣️
- Evaluation scripts are also public! Possible to build private leaderboards in-house 🤫
② Nejumi Leaderboard Evaluation Scripts
The evaluation scripts and evaluation methods are publicly available at:
③ Leaderboard Usage and Inquiries
Some datasets in the ALIGNMENT (ALT) data prohibit redistribution or unauthorized commercial use. When verifying yourself, please check the license terms. When issuing press releases, please use only the General Language Processing (GLP) as an indicator.
💡
For general inquiries about this leaderboard or if you're interested in creating a private leaderboard, please contact contact-jp@wandb.com
④ Information on LLM Evaluation
For those interested in the taxonomy, evaluation metrics, and evaluation methods used in the leaderboard, please refer to the "Best Practices for LLM Evaluation" section in the W&B whitepaper. Note that Weights & Biases has also published other LLM whitepapers which you may find useful.

LLMをゼロからトレーニングするためのベストプラクティス
このホワイトペーパーでは私たちがこれまでに蓄積してきたLLM開発のノウハウをご共有します

LLMファインチューニングとプロンプトエンジニアリング
このホワイトペーパーでは、ファインチューニングとプロンプトエンジニアリングについて一通り学ぶことができます

大規模言語モデル(LLM)評価のベストプラクティス
このホワイトペーパーではWeights & Biasesが国内最大級のLLM日本語評価リーダーボードであるNejumi.AIを開発・運営してきた経験に基づき、生成AI・LLM評価のベストプラクティスを共有します

NVIDIA環境で最新AIを検証できるAI TRY NOW PROGRAM
本番環境に近いパフォーマンスを測定し、導入の意思決定を加速
Overall Evaluation
- For llm-jp-eval (jaster), we use a 2-shot approach and calculate the evaluation for 100 questions in each test dataset. For Wiki data, we've set the number of questions to total 100 across the entire dataset.
- Each score is scaled from 0 to 1 (with 1 being superior) before aggregation, resulting in an average score out of 1 point.
- Definitions:
GLP : General Language Processing
ALT : Alignment
Total AVG = (Avg. GLP + Avg. ALT)/2
Run set
99
Run set
0
Overall EvaluationGLP_expressionGLP_translationGLP_summarizationGLP_information extractionGLP_reasoningGLP_mathematical reasoningGLP_entity extractionGLP_knowledge/QAGLP_englishGLP_semantic analysisGLP_syntactic analysisALT_controllabilityALT_ethicsALT_toxicityALT_biasALT_truthfulnessALT_robustnessAppendixDeeper dive into llm-jp-evalDeeper dive into MT-bench-jpTaxonomyExplanation of evaluation datasetsMT-bench-ja leaderboard detailsMT-bench-jp output details
GLP_expression
🗂️ Relevant Evaluation Datasets/Frameworks
MT-bench: roleplay, humanities, writing
📋 Results
🔍 Detailed Results
GLP_translation
🗂️ Relevant Evaluation Datasets/Frameworks
jaster: alt-e-to-j, alt-j-to-e, wikicorpus-e-to-j, wikicorpus-j-to-e (Each conducted with 0-shot and 2-shot)
📋 Results
🔍 Detailed Results
GLP_summarization
Not implemented yet
GLP_information extraction
🗂️ Relevant Evaluation Datasets/Frameworks
jaster: jsquad (Each conducted with 0-shot and 2-shot)
📋 Results
🔍 Detailed Results
GLP_reasoning
🗂️ Relevant Evaluation Datasets/Frameworks
MT-bench: reasoning
📋 Results
🔍 Detailed Results
GLP_mathematical reasoning
🗂️ Relevant Evaluation Datasets/Frameworks
jaster: mawps, mgsm (Each conducted with 0-shot and 2-shot)
MT-bench: math
📋 Results
🔍 Detailed Results
GLP_entity extraction
🗂️ Relevant Evaluation Datasets/Frameworks
jaster: wiki_ner, wiki_coreference, chABSA (Each conducted with 0-shot and 2-shot)
MT-bench: extraction
📋 Results
🔍 Detailed Results
GLP_knowledge/QA
🗂️ Relevant Evaluation Datasets/Frameworks
jaster: JCommonsenseQA, JEMHopQA, JMMLU, NIILC, aio(Each conducted with 0-shot and 2-shot)
MT-bench: stem
📋 Results
🔍 Detailed Results
GLP_english
🗂️ Relevant Evaluation Datasets/Frameworks
jaster: MMLU (Each conducted with 0-shot and 2-shot)
📋 Results
🔍 Detailed Results
GLP_semantic analysis
🗂️ Relevant Evaluation Datasets/Frameworks
jaster: JNLI, JaNLI, JSeM, JSICK, Jamp(Each conducted with 0-shot and 2-shot)
📋 Results
🔍 Detailed Results
GLP_syntactic analysis
🗂️ Relevant Evaluation Datasets/Frameworks
jaster: JCoLA-in-domain, JCoLA-out-of-domain, JBLiMP, wiki_reading, wiki_pas, wiki_dependency (Each conducted with 0-shot and 2-shot)
📋 Results
🔍 Detailed Results
ALT_controllability
🗂️ Relevant Evaluation Datasets/Frameworks
jaster: Metrics that can automatically evaluate the format of responses (chabsa, commonsensemoralja, jamp, janli, jblimp, jcola-in-domain, jcola-out-of-domain, jcommonsenseqa, jmmlu, jnli, jsem, jsick, kuci, mawps, mgsm, mmlu_en, wiki_dependency, wiki_ner)
LCTG bench
📋 Results
🔍 Detailed Results
ALT_ethics
🗂️ Relevant Evaluation Datasets/Frameworks
jaster: JcommonsenseMorality
📋 Results
🔍 Detailed Results
ALT_toxicity
🗂️ Relevant Evaluation Datasets/Frameworks
📋 Results
🔍 Detailed Results
ALT_bias
🗂️ Relevant Evaluation Datasets/Frameworks
JBBQ
📋 Results
🔍 Detailed Results
ALT_truthfulness
Not implemented yet
ALT_robustness
🗂️ Relevant Evaluation Datasets/Frameworks
📋 Results
🔍 Detailed Results
Appendix
Deeper dive into llm-jp-eval
llm-jp-eval overview (0 shot / 4shot)
llm-jp-eval leaderboard details (0 shot / 4shot)
Deeper dive into MT-bench-jp
MT-bench overview
MT-bench output details
Taxonomy
Explanation of evaluation datasets
llm-jp-eval (jaster)
Japanese MT-Bench
LCTG-Bench
LINE Yahoo Reliability Evaluation Dataset
BBQ/JBBQ
MMLU/JMMLU
MT-bench-ja leaderboard details
MT-bench-jp output details
Add a comment