Eisvogel: Evaluating German Language Proficiency

Evaluating LLMs for German language proficiency on a wide variety of tasks. This is an evolving Leaderboard with new models and tasks added periodically.
Ayush Thakur
Created on June 3|Last edited on January 27
Comment
﻿
A kingfisher (Alcedo atthis) with a wooden plaque.
In the rapidly evolving field of machine learning, language models (LLMs) have shown remarkable advancements. However, assessing their proficiency in specific languages, such as German, remains a critical challenge. To address this, we introduce the Eisvogel: A German LLM Leaderboard, a dedicated platform for evaluating and comparing LLMs on their German language capabilities.
Our leaderboard isn’t just a static measure; as the landscape of language models expands, so too will our leaderboard. Currently, our evaluation suite utilizes Holistic Evaluation of Language Models (HELM) which is both time tested and industry standard at this point. 
🇩🇪 German LLM Leaderboard (01/01/2025)GlossaryTechnical Details/MethodologyMultilingual Massive Multitask Language Understanding (MMMLU)Technical detailsResultsMGSMTechnical DetailsResultsEfficiencyAccuracy vs Efficiency (Mean Win Rate)Conclusion
﻿
🇩🇪 German LLM Leaderboard (01/01/2025)﻿
﻿
The table above will be updated periodically with new models and more tasks.
If you are someone who is training multi lingual foundational or fine-tuned models specific to the German language or have an interesting task/dataset to test LLMs do reach out at 
[ayusht at wandb.com].
﻿
We would love to incorporate your task and evaluate your model.
💡
This report will document the technical details and methodology of building this German evaluation suite.
GlossaryMean Win Rate (MWR): The mean win rate is a metric that reflects how often a model achieves a higher score than another model across various scenarios. For each scenario, we calculate the default accuracy metric—like exact match or F1 score—and then, instead of averaging these different metrics (which may have distinct scales and interpretations), we determine how frequently the model outperforms another. The mean win rate is the average of these performance comparisons across all scenarios, making it a meaningful measure even when different metrics vary in scale or units. However, it’s important to note that mean win rate only makes sense in the context of a set of models being compared and doesn’t carry meaning when interpreted in isolation. 
EM: The exact match metric evaluates correctness by checking if the model’s output matches the reference answer precisely, character by character. For an exact match, the generated output must be identical to the reference answer as a string, with no variations or discrepancies allowed.
Technical Details/MethodologyHolistic Evaluation of Language Models (HELM) is a comprehensive framework designed to evaluate the capabilities and performance of language models across a wide range of tasks and metrics. Our German LLM leaderboard is built on top of a fork of HELM.
HELM do not extensively taxonomize the world’s languages, as they mostly focus on predominantly evaluating English-only models. Most scenarios written in HELM cover English dialects and varieties. In order to create a multilingual (German here) evaluation suite on top of HELM, we wrote custom scenarios to include the following tasks in German language.
Multilingual Massive Multitask Language Understanding (MMMLU)
Multilingual Grade School Math Benchmark (MGSM)
You can find the configuration file (.conf) for our German evaluation suite here.
Before settling on HELM for our evaluations, we explored using lm-evaluation-harness. However, as highlighted in recent discussions and publications, the majority of API-based models do not support returning logits with echo=True, complicating the use of loglikelihood-based evaluations for multi-choice tasks like MMLU and ARC. Given these limitations and the challenges in adapting logit biases for meaningful task evaluations, we opted for HELM, which effectively addresses these concerns by facilitating generative evaluations. This approach aligns better with the capabilities of most API-based models and ensures a more consistent evaluation framework. This is further highlighted by one of the maintainers of lm-evaluation-harness in this issue comment.
Note however that we are aware of the limitations of generation based evaluations which we will address in a later section. Let's cover the tasks we have implemented so far and look at the results granularly.
Multilingual Massive Multitask Language Understanding (MMMLU)﻿MMLU covers question answering tasks posed as multi-choice QA (select A, B, C or D) where the model is allowed to generate a maximum of 1 token.  The benchmark contains a diverse set of 57 tasks, testing problem solving and general knowledge on STEM, humanities, social sciences and more. Since the model is only allowed to generate 1 token, we are literally evaluating the knowledge encoded in the weights during the pre-training phase (including instruct fine-tuning as well depending on the model of selection).
Technical detailsThe MMMLU benchmark is created by OpenAI by translating the original MMLU benchmark using human annotators. We selected this benchmark because of the higher degree of translation accuracy. The complete MMMLU benchmark contains 14 languages. Check out the data card here. Ofcourse we are using only the DE_DE subset of the dataset for this leaderboard.
The translation is done only for the test set. Since we do not have access to translated train or validation set, we are evaluating in zero-shot setting across all models.
The system prompt is given below:
Beantworten Sie die folgenden Multiple-Choice-Fragen zu {}. Jede Frage hat vier Antwortmöglichkeiten: A, B, C oder D. Wählen Sie die passendste Antwort und geben Sie nur den entsprechenden Buchstaben an.
This system prompt is a modification [explain modification in appendix] from MMLU where we are explicitly asking the model to return the response as A, B, C or D. Not doing so in zero shot setting was leading to incomparable results due to varying verbosity of the models. We want to acknowledge that a singular system prompt for every model is not the best way to check the encoded knowledge but explicitly "asking to do what is intended gives a fair chance to each model".
Exact Match (EM) is used as the choice of metric. The correctness condition for exact match is that the model generation match the correct reference exactly as strings.
ResultsBelow are the model wise scores on the MMMLU benchmark. We are also providing the exact match (accuracy) vs inference runtime plot to show the trade-off.
﻿
﻿
Since MMMLU consists of 57 unique tasks. It would be nice to aggregate the results subject wise and compare the performance of different models along those axis. Here I have categorised 57 unique tasks into 5 categories:
{
    "STEM": ['abstract_algebra', 'college_biology', 'college_chemistry', 'college_physics', 'astronomy', 'high_school_biology', 'high_school_chemistry', 'high_school_physics', 'high_school_mathematics', 'machine_learning', 'formal_logic', 'college_computer_science', 'high_school_computer_science', 'computer_security'],
    "Medical & Health": ['clinical_knowledge', 'college_medicine', 'medical_genetics', 'virology', 'human_aging', 'nutrition', 'professional_medicine', 'professional_psychology', 'human_sexuality'],
    "Social Sciences": ['high_school_us_history', 'high_school_world_history', 'sociology', 'public_relations', 'philosophy', 'jurisprudence', 'international_law', 'high_school_government_and_politics'],
    "Business & Economics": ['business_ethics', 'econometrics', 'management', 'marketing', 'high_school_macroeconomics', 'high_school_microeconomics', 'professional_accounting'],
    "Law & Ethics": ['moral_disputes', 'moral_scenarios', 'international_law', 'jurisprudence', 'philosophy', 'legal_ethics', 'security_studies']
}
Toggle the eye icon in the runset below to select models to be compared. Ideally have only a few models selected for best viewing experience.
﻿
Run set5
﻿
MGSM﻿MGSM is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages (German included). The benchmark was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
Technical DetailsWe are using 5-shot examples setting to evaluate the LLMs. (We could have used up to 8-shot examples)
We are evaluating using both Chain of Thought (CoT) reasoning and without it.
 The default max_tokens used for non CoT evaluation is 400. For CoT we are using max_tokens=600.
The prompt structure for non CoT evaluation is:
Frage: <few shot example question 1>
A: Die Antwort ist <ans 1>
﻿
Frage: <few shot example question 2>
A: Die Antwort ist <ans 2>
﻿
Frage: <few shot example question 3>
A: Die Antwort ist <ans 3>
﻿
Frage: <few shot example question 4>
A: Die Antwort ist <ans 4>
﻿
Frage: <few shot example question <5>
A: Die Antwort is <ans 5>
﻿
Frage: <question>
A:
The prompt structure for CoT based evaluation is:
Frage: <few shot example question 1>
A: <ans 1 with reasoning step>
﻿
Frage: <few shot example question 2>
A: <ans 2 with reasoning step>
﻿
Frage: <few shot example question 3>
A: <ans 3 with reasoning step>
﻿
Frage: <few shot example question 4>
A: <ans 4 with reasoning step>
﻿
Frage: <few shot example question <5>
A: <ans 5 with reasoning step>
﻿
Frage: <question>
A:
Results﻿
﻿
Since we evaluated using both direct and CoT prompting technique. Let us look at the results from this lens. Clearly Chain of Thought prompting is helping with improving the ability of LLMs to do mathematical reasoning. 
﻿
﻿
EfficiencyEfficiency is the other metric we are reporting with our Eisvogel Leaderboard.
"We report the observed inference runtime, by recording both the actual runtime and an estimated idealized runtime for the given request with an optimized software implementation run on A100 GPU(s), taking into account both the number of tokens in the prompt of the request, and the number of generated output tokens". Find more details on how efficiency is calculated  in the section 4.9 of the HELM paper.
If you are looking for blazing fast model, Gemini 1.5 Flash variants are really good. However, they aren't very performant on the benchmarks as shown in the next section.
The Claude variant of models are quite slow. They are performant but given few models are equally performant (even better) and faster, the Claude looses its charm.
The Mistral and GPT model variants hits that sweet spot. 
﻿
﻿
We are in the process of adding more models to the evaluation suite. We are also in the process of adding more analysis. Do reach out to us with suggestions. :)
💡
Accuracy vs Efficiency (Mean Win Rate)Looking at the accuracy and inference together can lead to more insights. Below, we are using the mean win rate to compare the models along both the accuracy and efficiency axis.
On the left side of this scatter plot, the Gemini Flash series of models are fast but not that accurate on our benchmark.
On the right side of this scatter plot, the Claude 3.5 sonnet is the slowest model but comes with decent performance on the benchmark.
Mistral Large models are performing really well on our benchmark, with Mistral Large (2411) variant hitting that sweet spot of high accuracy and decent latency. GPT-4o closely follows this sweet spot.
﻿
Run set1
﻿
ConclusionThe Eisvogel German LLM Leaderboard provides a dedicated platform for evaluating large language models on German language proficiency, covering diverse tasks such as MMMLU for knowledge testing and MGSM for mathematical reasoning (will continue to grow). 
Built on HELM, the leaderboard offers a robust, scalable framework to assess models on multilingual benchmarks through generation-based evaluations, accommodating differences in metric scales with mean win rate as a central performance metric. 
The evaluations demonstrate that while models like GPT-4o mini excel in efficiency, models such as Mistral Large 2 and Claude 3.5 Sonnet achieve stronger performance in complex reasoning tasks, albeit with longer inference times. The leaderboard will continue to evolve, integrating more models and tasks, making it a valuable resource for those developing or fine-tuning German language models.
﻿
﻿
﻿
Add a comment
Tags: Articles, LLM, Benchmark
Iterate on AI agents and models faster. Try Weights & Biases today.