Skip to main content

First Serbian LLM Eval

I’m happy to announce the first Serbian LLM eval. And in this report I’ll share how I’ve built it (with the help of many great people from my community!)
Created on December 16|Last edited on December 20
Big shout-out to my sponsors: Weights & Biases and to individuals and companies who supported the project from Serbia/Balkan region.
💡
LLM evaluation is hard.
It may be difficult with English because model creators are constantly adding (mostly inadvertently) evals/test data (or variations thereof) to their training datasets, rendering their results & conclusions incorrect.
But the situation is even more difficult in low-resource languages: evaluation benchmarks do not exist.
Whereas the strategy in English is to keep adding new & harder eval tasks continuously, the strategy in lower resource languages is less streamlined. :) (understatement of the century)
In this report, I’ll share the work I’ve done with the help of my amazing Discord community to create the first Serbian LLM eval.
On a side note: I think that ultimately the best solution for reliable evals will be to have crowd-sourced arenas of the type that lmsys folks have here and here. Having that sort of "real-time" (they update ~ once a month) Elo score akin to how we rank chess players & gamers just makes a lot of sense. Either that or synthetic evals (LLMs as a judge idea, etc.).
💡

The pipeline I built consists of the following 4 steps:
  1. Find a set of relevant English eval benchmarks
  2. Use the best available machine translation system for the target language (for Serbian, that turned out to be Google Translate)
  3. Use GPT-4 to refine the translations
  4. Potentially, although, as I’ll argue below, likely unnecessarily, add humans to the loop and get to the “gold standard.”
This report will break these four steps down into more detail. Let’s go!

1) Picking the right English evals

An important property of evals is that they’re used by many other models & papers.
I am currently doing continued pre-training of yugoGPT based on Mistral-7B, so naturally, I looked at the evals that Mistral folks have done so that I can later compare them with them.
Here they are:
I made an intersection of those with the following:
  • Set of tasks that are available in the excellent lm-evaluation-harness by EleutherAI (at least before they integrate their big-refactor branch into the main branch - that will add many more tasks)
  • Set of tasks that make sense for a “non-English” LLMs → e.g. I don’t care as much about code performance in Serbian. Coding is mostly done in English.
Later down the line, we can add more tasks, but one has to prioritize.
This reasoning gave me the following set of tasks:
  • Common sense reasoning: HellaswagWinograndePIQAOpenbookQAARC-EasyARC-Challenge
  • World knowledge: NaturalQuestionsTriviaQA
  • Reading comprehension: BoolQ

2) Machine Translation

After choosing the initial set of tasks, the next obvious step is translating them.
Earlier this summer, I compared various MT systems’ performance in the English → Serbian direction. I tried some publicly available options like Meta’s NLLB & M2M-100 projects, Systran, Opus MT, I tried running GPT-4 directly, etc. but ultimately the best performance I got came from Google Translate. I’ve tested these systems on the English-Serbian direction on the Flores 200 dataset, so the results won’t be super robust (talking about evals being hard…) as Flores 200 is very clean (academic in nature) - but that was the best thing I had available.
Now that I knew what to use, I organized my Discord server to translate the selected 67.9 M characters (MT systems usually bill per 1M chars). Those are only the characters that are needed and used in the lm-evaluation-harness, so if you manually analyze the files here on HuggingFace and notice some English text, that’s the reason why.
At 20$/1M characters, Google Translate would cost me ~1360$. But luckily, Google is kind enough to offer 500k characters free each month. We started the translation effort one day before the end of the month :)
Additionally, when you just open up your Google Cloud account, you get 300$ for free.
Those factors taken together combined with an amazing group of people that I’ll mention here: Vera Prohaska, Chu Kin Chan, Joe Makepeace, Toby Farmer, Malvi Bid, Raphael Vienne, Nenad Aksentijevic, Isaac Nicolas, Brian Pulfer - and we translated the whole corpus in a matter of days.
I was delegating ranges and making sure we were all on the same page. We didn’t have any special process - they would send me the results either via Discord (as the files are small) or via email. I would merge, comparing against source English data to reduce the chance of introducing bugs.
But, unfortunately, bugs happen. I had a shallow copy bug (using Python’s copy instead of deepcopy) that caused me to mess up all answer translations in the triviaqa's train split. I initially thought that’s just a minor portion of the dataset, but it turns out that:
  • triviaqa train is by far the largest split with ~137k documents.
  • answer somewhat counter-intuitively takes up ~30M chars, whereas the question portion takes significantly less.
And there goes another 600$.
Luckily, the community helped again! :) So, in reality, we spent 0 $.
A subset of the people from above helped tackle this bug, and Aldin Cimpo also joined the effort! Raphael Vienne helped ensure everyone was on the same page and knew exactly what needed to be done (explaining which changes to pull, what parts of the code to comment to get this to work, etc.). After two days, we were done! The last step here was to transliterate from Serbian Cyrillic into Serbian Latin, which was fairly easy given this pip package.
The snapshot of the eval at this point is labeled with v0 and can be found here: https://huggingface.co/datasets/gordicaleksa/serbian-llm-eval-v0
(triviaqa has been updated to the version w/o the nasty bug)
After this translation effort, I organized a datathon to find interesting failure cases of machine translation & gather some insights that I can use to create a better refinement pipeline. That brings us to the next part!

3) GPT-4 translation refinement

The main insights that I collected during the datathon are the following:
Machine translation systems lack world knowledge and are consequently too literal.
It’s clear to me at this point that LLMs will replace standard encoder-decoder architectures in translation tasks. Or "legacy" MT systems will be used only as the grounding signal to help reduce the hallucinations - which is precisely what I’ve done here.
Note: a representative failure case was this one, GTranslate would take the band name “Green Day” in English and translate it quite literally into Serbian as "zeleni dan" - thus completely messing up the translation (the right thing to do here would be to leave English under quotation marks).
💡
The first thing I tried was to give GPT-4 the machine translation as the source of "inspiration", i.e. grounding, and force it to reason about its translation before it outputs the actual Serbian translation. In parallel, I was also playing with the option without the explicit reasoning step.
After multiple experiments across all the datasets where I was taking 10 MT failure cases and then manually comparing the two options (w/ and w/o reasoning) side by side, it became clear that having no explicit reasoning is superior.
There are 2 reasons behind that conclusion:
  • It spends fewer tokens (as both the prompt (input tokens) and the output (output tokens) are shorter)
  • It consistently gave better translations
Here is an example prompt I used to elicit explicit reasoning:
I would like you to help me refine an existing translation from Google Translate from English into Serbian (Latin script).

Google translation often has grammatical, syntax errors as well as demonstrations of a basic lack of world knowledge that cause poor translations.

It also erroneously removes the "_" symbol that is vital because these sentences are from Winogrande evaluation task.

Before you give the final Serbian translation in Latin script ensure you provide step-by-step reasoning for why the Google Translation failed.
If the translation is good, explain why it is a good translation and proceed to generate a correct, high-quality translation.

Make sure you first reason before outputting the final Serbian translation. During your reasoning process insert both options into the original sentence and make sure it fits perfectly according to Serbian grammar.

Make sure never to remove the "_". And if "_" is missing you need to reinsert it back using the English translation to understand where to insert it.

Output format should be:

REASONING:
[your reasoning goes here, use bullet points]
SERBIAN:
"sentence": refined sentence goes here
"option1": refined option 1 goes here
"option2": refined option 2 goes here

Note: sentence and options are obviously related so do use the information in "sentence" to help you translate "option1"/"option2" and vice versa.
You must be able to replace the "_" with "option1" or "option2" and get a cohesive, naturally sounding sentence.

Remember: All output text should be in Serbian except that "REASONING", "SERBIAN", "sentence", "option1", "option2" are special tokens that must not be translated.

Bellow is the input (both original in English and output from Google Translate). Generate a high quality correct translation.

ENGLISH:
"sentence": {src_sentence}
"option1": {src_option1}
"option2": {src_option2}

SERBIAN (GOOGLE TRANSLATE):
"sentence": {trg_sentence}
"option1": {trg_option1}
"option2": {trg_option2}
The prompts varied from task to task. In some other tasks, I also had few shot examples of how to reason - but all failed similarly.
On the other side, here is the simple prompt that won:
I would like you to help me translate English into Serbian (Latin script).

You are given a translation from Google translate as a source of inspiration, but bear in mind that it often has grammatical, syntax errors as well as demonstrations of a basic lack of world knowledge that cause poor translations.

Make sure never to remove the "_". And if "_" is missing you need to reinsert it back using the English translation to understand where to insert it.

Output format should be:

SERBIAN:
"sentence": refined sentence goes here
"option1": refined option 1 goes here
"option2": refined option 2 goes here

Remember: All output text should be in Serbian except that "SERBIAN", "sentence", "option1", "option2" are special tokens that must not be translated.

Bellow is the input (both original in English and output from Google Translate). Generate a high quality correct translation in Serbian.

ENGLISH:
"sentence": {src_sentence}
"option1": {src_option1}
"option2": {src_option2}

SERBIAN (GOOGLE TRANSLATE):
"sentence": {trg_sentence}
"option1": {trg_option1}
"option2": {trg_option2}
If you speak any of the HBS languages (Serbian et al) you can check out how I compare these 2 in my Discord server here.
The snapshot of the eval at this point is labeled with v1 and can be found here:
https://huggingface.co/datasets/gordicaleksa/serbian-llm-eval-v1
This part of the pipeline was generously sponsored by Weights & Biases ❤️ also, a bit later, some local companies and individuals were also kind to help financially - you can see the list of sponsors here. I burned through a lot of GPT-4 credits. :)
Finally, the only thing left was to consider whether we want to have human annotators improve the evals even further, and that brings me to the last section.

4) Human in the loop

In parallel, while I was doing the GPT-4 refinement, Nikola Ljubesic (a prolific Croatian NLP researcher) reached out asking me to send him winogrande translation so that his annotator could go through it, fix it, and better understand failure modes of translation. I thought it’d also be super cool to compare the results with GPT-4 later.
After going through an email thread with him, here are some findings:
  • According to the annotator, GPT-4 significantly improves machine translation (consistent with the tests I’ve done while refining the MT outputs)
  • In a couple of cases, the annotator described GPT-4 had better translation than both her & the MT system. She said that given GPT-4 as the starting point, she would have done a much better herself.
  • Serbian and HBS languages in general, are morphologically rich and it’s easy to use the gender of the nouns to infer the solution (without even relying on the semantics of the sentence). GPT-4 is not good at these subtleties of the language. But, ultimately, I believe that won’t matter at scale.
Here is why I think that: if we understand how the eval on a specific task like winogrande is done (see the tables below for some samples both in English & Serbian), it puts things in a slightly different perspective. We do the following:
  • Take option1, insert it into the sentence (replacing the _), and feed that first part of the sentence to the model (everything up to _ + option1)
  • Take option2, and repeat the process the same as above
  • Finally, independently for cases 1) and 2), the model is given the continuation and asked: what is more likely? The (log)probabilities are taken over the tokens of the sentence continuation and summed up.


Run set
3

Thus small grammatical/morphological/gender errors likely won’t matter much and won’t significantly change eval results when running on 100s of thousands of samples. This hypothesis would have to be tested. But it’s hard (expensive) to annotate the whole eval with human annotators to verify whether this hypothesis is correct.
Therefore, given a budget and an option, I would prioritize GPT-4 over human annotation, given all the data points I've seen so far. I believe it’s more important to continuously update the evals instead of having static gold standard evals. As we saw from the lmsys folks, all of these eval samples will eventually enter LLM’s dataset and render conclusions incorrect.
I dealt with the annotators often during my first job back at Microsoft on the HoloLens project. I know how complicated these labeling pipelines tend to get if you want to do things with the least amount of human errors being introduced. To take the human annotator route, I’d prefer to build at least a minimal Streamlit UI + basic error checking to make it easier for annotators to do their job.
So there you have it. That’s a brief story of how the first Serbian LLM was built! May the future AGIs bless us.

My next steps

  • Share the yugoGPT evaluation results on the Serbian LLM eval
  • Share a web app where you’ll be able to play with yugoGPT
  • Open-source yugoGPT base model
The last update I made last week was the 2nd crash I experienced due to my bad check-pointing policy (the green one failed because of GPU failure, but the blue one was my fault :))


If you want to cite this work

@article{serbian-llm-eval,
author = "Gordić Aleksa",
title = "Serbian LLM Eval",
year = "2023"
howpublished = {\url{https://huggingface.co/datasets/gordicaleksa/serbian-llm-eval-v1}},
}
Iterate on AI agents and models faster. Try Weights & Biases today.