Skip to main content

Nejumi LLM Leaderboaed 4

Enhancing Evaluation of Application Development Capabilities and AI Safety to Support Practical LLM Selection
Created on September 17|Last edited on September 18

With Nejumi Leaderboard 4, we set out to raise the resolution of evaluation in response to the saturation problem of existing benchmarks.
📄📄📄 For those who believe that “evaluating models is important, but evaluating generative AI applications is even more critical,” here’s our latest white paper.

Evaluation Taxonomy (click to expand for details)

Main Leaderboard

Sorted by total score (the number on the left indicates the evaluation job ID, not ranking).

model_size_category
汎用的言語性能(GLP)_AVG
GLP_応用的言語性能
GLP_推論能力
GLP_知識・質問応答
GLP_基礎的言語性能
GLP_アプリケーション開発
GLP_表現
GLP_翻訳
GLP_情報検索
GLP_抽象的推論
GLP_論理的推論
GLP_数学的推論
GLP_一般的知識
GLP_専門的知識
GLP_意味解析
GLP_構文解析
GLP_コーディング
GLP_関数呼び出し
アラインメント(ALT)_AVG
ALT_制御性
ALT_倫理・道徳
ALT_毒性
ALT_バイアス
ALT_真実性
ALT_堅牢性
AVG_jaster_0shot
AVG_jaster_2shots
AVG_mtbench
AVG_swebench
model_size
model_release_date
6
2
4
26
21
13
30
1
12
23
75
15
20
31
22
8
18
61
model_name
TOTAL_SCORE
Run set
77



Breakdown of Each Model’s Features by Category

From the list of models below, you can use the 👁️ icon to select and change which models are displayed.

Run set
2




List<File<(table)>>