Skip to main content

Polyglot Releases Korean GPT Models Towards Balanced Multilingual Language Model Goal

Polyglot, a team within EleutherAI, has released a set of open-source Korean language GPT models, with prospects towards creating non-English-centric, focused multilingual language models.
Created on September 28|Last edited on October 3
The Polyglot team at EleutherAI, has released a set of open-source Korean language GPT models as the first step towards their goal of creating multilingual language models.
Links to the pre-trained model weights and model cards for Polyglot-Ko, as well as the training runs (performed on Weights & Biases), can be found on their GitHub repository. Currently, Polyglot-Ko is available in 1.3B and 3.8B sizes, with a 6.7B model coming soon. Benchmarks showed equal or greater performance when compared to the current best publicly available models.
These Korean language models are the first step towards creating their first multilingual model, focused on popular East-Asian languages (including English).

Polyglot makes multilingual models less English-centric

The Polyglot team sees an issue with how large multilingual language models have been constructed recently. More and more models have had the goal to cover as many languages as possible, yet the datasets they train on feature a disproportionately large English-language portion.
With excess English-language training data and the focus on including many different languages, these models become unfocused and biased towards English, leaving many of the languages with a small volume of training data in the dust, while still technically supporting them.
With the aim to make multilingual models less English-centric, have higher performance in non-English languages, and be overall more focused, the Polyglot team will dedicate itself to creating multilingual language models with better-balanced data. PolyGlot's models will forego covering dozens of languages at once to focus on closely related languages that might have high synergy between them.
Their first multilingual language model goal, Polyglot-East-Asian, is a suite of East-Asian languages, featuring Korean, Chinese, Japanese, Indonesian, Malay, Vietnamese, Thai, as well as English.

Find out more

If you have Discord, join the EleutherAI server and read the announcement message in this channel.
Learn about the Polyglot team's efforts on their GitHub repository.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.