Polyglot-Ko: Open-Source Korean Autoregressive Language Model
Technical Report of Polyglot-Ko
Created on November 13|Last edited on November 13
Comment
1. What's the Polyglot project?2. Datasets2.1. Data Collection2.2. Data Analysis2.3. Data preprocessing2.3.1. Personally Identifiable Information Filtering2.3.2. Quality Filtering3. Training procedure & Models3.1. Polyglot-ko-1.3B3.2. Polyglot-ko-3.8B3.3. Polyglot-ko-5.8B4. Performance Evaluation4.1. COPA (F1 score)4.2. HellaSwag (F1 score)5. Checkpoints Merging & Model Upload6. What’s Next?
1. What's the Polyglot project?
Polyglot is a project making multilingual language models. So far, various multilingual models such as mBERT, BLOOM and XGLM have been released. but “why should we make new multilingual models?” Before answering the question, we would like to ask, "Why do people around the world make monolingual models in their language even though there are already many multilingual models?" We would like to point out there is a dissatisfaction with the non-English language performance of the current multilingual models as one of the most significant reason. So we want to make multilingual models with higher non-English language performance. This is the reason we need to make multilingual models again and why we name them 'Polyglot'.
But Polyglot Korean models to be introduced in this post are not multilingual. When we started our research, we have already had 1.2TB of Korean data collected by TUNiB. Before we collected a large amount of multilingual data, we decided to try making Korean models with the dataset we already had. With Korean model training, 1) We can practice multi-node training, 2) These models could be used for performance comparison with the multilingual models, and 3) The models themselves would be helpful to many Korean companies and researchers. For these reasons, we made the Korean models before the multilingual models.
2. Datasets
2.1. Data Collection
We collected 863 GB of Korean language data (1.2TB before processing), a large-scale dataset curated with TUNiB. The data collection process has abided by South Korean laws. This dataset was collected for the purpose of training Polyglot-Ko models, so it will not be released for public use.
| Source | Size (GB) | Link |
|---|---|---|
| Korean blog posts | 682.3 | - |
| Korean news dataset | 87.0 | - |
| Modu corpus | 26.4 | corpus.korean.go.kr |
| Korean patent dataset | 19.0 | - |
| Korean Q & A dataset | 18.1 | - |
| KcBert dataset | 12.7 | github.com/Beomi/KcBERT |
| Korean fiction dataset | 6.1 | - |
| Korean online comments | 4.2 | - |
| Korean wikipedia | 1.4 | ko.wikipedia.org |
| Clova call | < 1.0 | github.com/clovaai/ClovaCall |
| Naver sentiment movie corpus | < 1.0 | github.com/e9t/nsmc |
| Korean hate speech dataset | < 1.0 | - |
| Open subtitles | < 1.0 | opus.nlpl.eu/OpenSubtitles.php |
| AIHub various tasks datasets | < 1.0 | aihub.or.kr |
| Standard Korean language dictionary | < 1.0 | stdict.korean.go.kr/main/main.do |
2.2. Data Analysis
The reason we want to analyze data is to reduce the risk that can arise during training and inference. For example, Empty or too short text data, repeated words and characters, and duplicated data instances could be problematic during model training. In the inference stage, personally identifiable information (PII) could be problematic.
As a result of examining the data from the perspective of reducing these risks, we were able to classify them into four types:
- Data available to train: mainly news and wikipedia data, which have a lot of information with long enough text.
- Data that needs contextual information around it to train: mainly blog data and news data. It consisted a lot of short texts which were scrapped incorrectly.
- Data with a lot of hate speech: mainly comments datasets from some community websites.
- NLP task data: NLP task data like text classification or entity recognition. It can be used for model training, but needs to be dealt with separately when evaluating the model.
While examining the data types, there were quality issues that had to be processed in the data to be learned. Since these issues can cause huge problems in model training, we have organized them and added them to the text preprocessing work.
- Empty text: no text in the data instance
- Unnecessary space: multiple unnecessary spaces in the data instance
- De-identification: personally identifiable information in the data instance
- Uncleaned HTML tags: HTML tags are not removed
- Deduplication: duplicated data instance (based on exact match)
- Broken code: only part of HTML or Markdown exists
- Short text: too short data instance
- Repeated character: repeated characters in the data instance
In particular, since the purpose of polyglot model is Korean text generation, html-like code has been removed as much as possible.
Another important part of preprocessing data is the data length. The longer the text length, the more context information can be included in model training, and the shorter the text length, the smaller the context information of the text to be generated. What we want to explore is to see how the text lengths of each dataset we collect are distributed. We collected about 50 datasets, sampled them, and looked at the length distribution of each data as a box plot.

The figure above is the token length distribution for all datasets we’ve collected. You can see that most of the datasets had short texts, and only a few datasets had long texts. Among these, long texts were mostly news articles and blogs. Based on these results, it was possible to establish a filter condition for the text length, and it could be expected that the contextual information of information about news or blogs was most learned by the model during training.
2.3. Data preprocessing
Based on these analysis results, we performed the following data preprocessing.
2.3.1. Personally Identifiable Information Filtering
We have tried to minimize the damage caused by personally identifiable information leakage by masking the following information in the data. However, although some data may be harmful, it will help us to use the language model for various tasks. Therefore, we have masked only the bare minimum of critical personal information.
- Phone Number: Masked landline and wireless phone (cell phone) numbers in the text to ‘<|tel|>’
- Resident Registration Number: Masked the resident registration number in the text to ‘<|rrn|>’
- Bank Account Number: Masked the bank account number in the text to ‘<|acc|>’. We processed the 12 banks account: IBK, KB, NH, Shinhan, Woori, KEB, City Bank, DGB, BNK, SC, K-Bank, Kakao Bank
2.3.2. Quality Filtering
We borrowed the filtering logic described in [Gopher](https://arxiv.org/abs/2112.11446) and used it for data preprocessing. We have determined that these filtering methods can sufficiently solve the problems mentioned above.
- document length filter: filter the text instance If the length of a certain condition is not met.
- mean length word filter: filter if the average word length of text was unusually long.
- symbol to word ratio filter: filter if there are significantly more symbols than words in the text, this was helpful to remove many HTML tags in the dataset.
- bullet and ellipsis filter: filter if there are many bullets or ellipsis in the text.
- alphabetic word ratio filter: filter if the text contains a large proportion of characters other than language characters.
3. Training procedure & Models
For polyglot model training, EleutherAI's GPT-NeoX codebase was used, and with the great help of Stability AI, 256 A100s (8 * 32 nodes) were used to train the models.
Run set
480
3.1. Polyglot-ko-1.3B
The 1.3B model uses the following hyper parameters.
| Hyperparameter | Value |
|---|---|
| n_parameters | 1,331,810,304 |
| n_layers | 24 |
| d_{model | 2,048 |
| d_ff | 8,192 |
| n_heads | 16 |
| d_head | 128 |
| n_ctx | 2,048 |
| n_vocab | 30,003 / 30,080 |
| Positional Encoding | Rotary Position Embedding (RoPE) |
| RoPE Dimensions | 64 |
Since the 1.3B model was enough to train without model parallelism, model parallelism was not applied and the total batch size was given to 1024. When training the 1.3B model, it was observed that overfitting occurred in both the training and validation data as the loss value dropped sharply from the point near 100,000 steps.
In fact, it was observed that the performance deteriorated sharply even in the result of running the inference. Therefore, the model was selected after evaluating and verifying the checkpoints before the point where the overfitting occurred.
3.2. Polyglot-ko-3.8B
The 3.8B model uses the following hyper parameters.
| Hyperparameter | Value |
|---|---|
| n_parameters | 3,809,974,272 |
| n_layers | 32 |
| d_{model | 3,072 |
| d_ff | 12,288 |
| n_heads | 24 |
| d_head | 128 |
| n_ctx | 2,048 |
| n_vocab | 30,003 / 30,080 |
| Positional Encoding | Rotary Position Embedding (RoPE) |
| RoPE Dimensions | 64 |
We used pipeline parallelization for 3.8B model training and the overall batch size is the same as that of 1.3B. As shown in the figure above, similar to 1.3B, overfitting started to occur at a point near 100,000 steps, so we stopped model training.
3.3. Polyglot-ko-5.8B
The 5.8B model uses the following hyperparameters.
| Hyperparameter | Value |
|---|---|
| n_parameters | 5,885,059,072 |
| n_layers | 28 |
| d_{model | 4,096 |
| d_ff | 16,384 |
| n_heads | 16 |
| d_head | 256 |
| n_ctx | 2,048 |
| n_vocab | 30,003 / 30,080 |
| Positional Encoding | Rotary Position Embedding (RoPE) |
| RoPE Dimensions | 64 |
Pipeline parallelization was applied to the 5.8B model as well, and the overall batch size was also reduced by 1/4 compared to the 1.3B and 3.8B models. Because the batch size was reduced to 1/4, overfitting did not occur up to 320,000 steps, which is the total training step, and it was confirmed that the model performance increased stably as the steps increased.
4. Performance Evaluation
We performed a few-shot prompting based evaluation of the polyglot Korean model using the KOBEST benchmark dataset. Using the prompts described in the paper, we performed a performance comparison between our model and the Korean large-scale model published on the huggingface hub, and the results are shown in the following tables. 'n' refers to the number of few-shot examples. For a fair comparison, all models were run under the same conditions and using the same prompts.
And you can also reproduce these results using the polyglot branch of lm-evaluation-harness and, the following scripts.
python main.py \--model gpt2 \--model_args pretrained='EleutherAI/polyglot-ko-1.3b' \--tasks kobest_copa,kobest_hellaswag \--num_fewshot $YOUR_NUM_FEWSHOT \--batch_size $YOUR_BATCH_SIZE \--device $YOUR_DEVICE \--output_path $YOUR_PATH
4.1. COPA (F1 score)

Performance of COPA task
| Model | params | n=0 | n=5 | n=10 | n=50 |
|---|---|---|---|---|---|
| skt/ko-gpt-trinity-1.2B-v0.5 | 1.2B | 0.6696 | 0.6477 | 0.6419 | 0.6514 |
| kakaobrain/kogpt | 6.0B | 0.7345 | 0.7287 | 0.7277 | 0.7479 |
| facebook/xglm-7.5B | 7.5B | 0.6723 | 0.6731 | 0.6769 | 0.7119 |
| EleutherAI/polyglot-ko-1.3b | 1.3B | 0.7196 | 0.7193 | 0.7204 | 0.7206 |
| EleutherAI/polyglot-ko-3.8b | 3.8B | 0.7595 | 0.7608 | 0.7638 | 0.7788 |
| EleutherAI/polyglot-ko-5.8b | 5.8B | 0.7745 | 0.7676 | 0.7775 | 0.7887 |
4.2. HellaSwag (F1 score)

Performance of HellaSwag task
| Model | params | n=0 | n=5 | n=10 | n=50 |
|---|---|---|---|---|---|
| skt/ko-gpt-trinity-1.2B-v0.5 | 1.2B | 0.4036 | 0.4 | 0.4011 | 0.4214 |
| kakaobrain/kogpt | 6.0B | 0.4599 | 0.456 | 0.4616 | 0.4754 |
| facebook/xglm-7.5B | 7.5B | 0.4261 | 0.437 | 0.4409 | 0.4517 |
| EleutherAI/polyglot-ko-1.3b | 1.3B | 0.4013 | 0.3984 | 0.417 | 0.4416 |
| EleutherAI/polyglot-ko-3.8b | 3.8B | 0.4438 | 0.4786 | 0.4737 | 0.4822 |
| EleutherAI/polyglot-ko-5.8b | 5.8B | 0.4853 | 0.482 | 0.4968 | 0.5012 |
It was observed that the performance of our models increase as the number of parameters increases and the number of few-shots increases. In addition, it showed relatively good performance compared to the existing Korean large-scale models.
5. Checkpoints Merging & Model Upload
In order to upload the trained checkpoints to the Hugging Face model hub, it was necessary to merge the checkpoints in GPT-NeoX into a single checkpoint file and convert it to fit the Hugging Face Transformers. Therefore, we made scripts for it, and proceeded with the following procedure.
- Create config file: The gpt-neox config which is used for polyglot training cannot be used in the Hugging Face Transformers directly. Therefore, we made a config file by putting several parameters related to the model in config.json except for the parameters related to training among the yaml files used for training.
- Merge models by layers: The next step is to start merging the split checkpoints. First, so that each layer can become one checkpoint, the tensors of individual layers stored in each gpu are merged into one in the master process (gpu0) and saved by layer as shown in the figure below. (e.g model_layer_1.bin)
- Merging and converting model checkpoint: Convert the checkpoints saved for each layer as one file, convert them to the key used in Hugging Face Transformers, and save it as a state dict. (e.g: model.bin)
We successfully deployed our models and you can check out those models here.
6. What’s Next?
Currently, we are training the Korean language model of 12.8B, and we plan to eventually expand it to 40B. We went through a lot of trial and error while making Korean models, and based on this experience, we are also making two types of multilingual models that can process many languages. The first is East-Asian model which consists Korean, Chinese, Japanese, Indonesian, Malay, Vietnamese, Thai and English. The second is Romance model which consists Spanish, Portuguese, French, Romanian and Italian.
With democratizing and promoting access to language model technology worldwide, We look forward to the development of research and academics in their country. If you are interested in our research or would like to join us, please feel free to email us at kevin.ko@tunib.ai or join our Discord channel. Thanks.
Add a comment