Skip to main content

Heron VLM Leaderboard powered by Nejumi/WandB

This is the initial version of a leaderboard for evaluating Vision & Language models in Japanese.
Created on June 9|Last edited on June 24

日本語版はこちらからご覧いただけます

What is the Heron VLM Leaderboard?

The purpose of this leaderboard is to provide an evaluation and automatic scoring method for Vision & Language models. The evaluation is based on the following datasets:
Technical support is provided by Turing Inc., while W&B Japan is responsible for building and operating the leaderboard.

Visual Question Answering (VQL)


ave_llava_itw
llava_complex
llava_conv
llava_detail
ave_heron
heron_complex
heron_conv
heron_detail
17
18
1
13
9
7
10
14
11
3
16
6
8
19
12
4
5
model_name
Average Score
Run set
19


Model Comparison

From the list of Run sets, select the pair of models you want to compare by clicking on the eye icons 👁️.
The radar chart, LLaVA Bench table, and Heron Bench table are linked and displayed together. This allows you to simultaneously view the performance profiles of any pair of models and the differences in their responses to the same questions, enabling an interactive comparison.

Run set
5


Llava Bench (in the wild) Output Details

Select the model you want to check from the Model list by clicking on the 👁️ icon. For example, if you want to filter the category by coding, click the ▽ button at the bottom left of runs.summary["llava_table"] and enter the following query (reference: general explanation article about queries).

Run set
3


Heron Bench Output Details

Select the model you want to check from the Model list by clicking on the 👁️ icon. For example, if you want to filter the category by coding, click the ▽ button at the bottom left of runs.summary["heron_table"] and enter the following query (reference: general explanation article about queries).

Run set
1


Correspondence with Scores Published in arXiv

There is a difference between the original paper by Turing Inc., which uses relative values to GPT-4 scores, and this leaderboard, which uses absolute values. This correspondence has been verified as shown in the figure below.


Automated Model Evaluation by WandB Automations

TBD

List<File<(table)>>