Skip to main content

Polyglot-Ko: Open-Source Korean Autoregressive Language Model

Technical Report of Polyglot-Ko
Created on November 13|Last edited on November 13



1. What's the Polyglot project?

Polyglot is a project making multilingual language models. So far, various multilingual models such as mBERTBLOOM and XGLM have been released. but “why should we make new multilingual models?” Before answering the question, we would like to ask, "Why do people around the world make monolingual models in their language even though there are already many multilingual models?" We would like to point out there is a dissatisfaction with the non-English language performance of the current multilingual models as one of the most significant reason. So we want to make multilingual models with higher non-English language performance. This is the reason we need to make multilingual models again and why we name them 'Polyglot'.
But Polyglot Korean models to be introduced in this post are not multilingual. When we started our research, we have already had 1.2TB of Korean data collected by TUNiB. Before we collected a large amount of multilingual data, we decided to try making Korean models with the dataset we already had. With Korean model training, 1) We can practice multi-node training, 2) These models could be used for performance comparison with the multilingual models, and 3) The models themselves would be helpful to many Korean companies and researchers. For these reasons, we made the Korean models before the multilingual models.


2. Datasets

2.1. Data Collection

We collected 863 GB of Korean language data (1.2TB before processing), a large-scale dataset curated with TUNiB. The data collection process has abided by South Korean laws. This dataset was collected for the purpose of training Polyglot-Ko models, so it will not be released for public use.
SourceSize (GB)Link
Korean blog posts682.3-
Korean news dataset87.0-
Modu corpus26.4corpus.korean.go.kr
Korean patent dataset19.0-
Korean Q & A dataset18.1-
KcBert dataset12.7github.com/Beomi/KcBERT
Korean fiction dataset6.1-
Korean online comments4.2-
Korean wikipedia1.4ko.wikipedia.org
Clova call< 1.0github.com/clovaai/ClovaCall
Naver sentiment movie corpus< 1.0github.com/e9t/nsmc
Korean hate speech dataset< 1.0-
Open subtitles< 1.0opus.nlpl.eu/OpenSubtitles.php
AIHub various tasks datasets< 1.0aihub.or.kr
Standard Korean language dictionary< 1.0stdict.korean.go.kr/main/main.do



2.2. Data Analysis

The reason we want to analyze data is to reduce the risk that can arise during training and inference. For example, Empty or too short text data, repeated words and characters, and duplicated data instances could be problematic during model training. In the inference stage, personally identifiable information (PII) could be problematic.
As a result of examining the data from the perspective of reducing these risks, we were able to classify them into four types:
  1. Data available to train: mainly news and wikipedia data, which have a lot of information with long enough text.
  2. Data that needs contextual information around it to train: mainly blog data and news data. It consisted a lot of short texts which were scrapped incorrectly.
  3. Data with a lot of hate speech: mainly comments datasets from some community websites.
  4. NLP task data: NLP task data like text classification or entity recognition. It can be used for model training, but needs to be dealt with separately when evaluating the model.
While examining the data types, there were quality issues that had to be processed in the data to be learned. Since these issues can cause huge problems in model training, we have organized them and added them to the text preprocessing work.
  • Empty text: no text in the data instance
  • Unnecessary space: multiple unnecessary spaces in the data instance
  • De-identification: personally identifiable information in the data instance
  • Uncleaned HTML tags: HTML tags are not removed
  • Deduplication: duplicated data instance (based on exact match)
  • Broken code: only part of HTML or Markdown exists
  • Short text: too short data instance
  • Repeated character: repeated characters in the data instance
In particular, since the purpose of polyglot model is Korean text generation, html-like code has been removed as much as possible.
Another important part of preprocessing data is the data length. The longer the text length, the more context information can be included in model training, and the shorter the text length, the smaller the context information of the text to be generated. What we want to explore is to see how the text lengths of each dataset we collect are distributed. We collected about 50 datasets, sampled them, and looked at the length distribution of each data as a box plot.

The figure above is the token length distribution for all datasets we’ve collected. You can see that most of the datasets had short texts, and only a few datasets had long texts. Among these, long texts were mostly news articles and blogs. Based on these results, it was possible to establish a filter condition for the text length, and it could be expected that the contextual information of information about news or blogs was most learned by the model during training.


2.3. Data preprocessing

Based on these analysis results, we performed the following data preprocessing.

2.3.1. Personally Identifiable Information Filtering

We have tried to minimize the damage caused by personally identifiable information leakage by masking the following information in the data. However, although some data may be harmful, it will help us to use the language model for various tasks. Therefore, we have masked only the bare minimum of critical personal information.
  • Phone Number: Masked landline and wireless phone (cell phone) numbers in the text to ‘<|tel|>’
  • Resident Registration Number: Masked the resident registration number in the text to ‘<|rrn|>’
  • Bank Account Number: Masked the bank account number in the text to ‘<|acc|>’. We processed the 12 banks account: IBK, KB, NH, Shinhan, Woori, KEB, City Bank, DGB, BNK, SC, K-Bank, Kakao Bank

2.3.2. Quality Filtering

We borrowed the filtering logic described in [Gopher](https://arxiv.org/abs/2112.11446) and used it for data preprocessing. We have determined that these filtering methods can sufficiently solve the problems mentioned above.
  • document length filter: filter the text instance If the length of a certain condition is not met.
  • mean length word filter: filter if the average word length of text was unusually long.
  • symbol to word ratio filter: filter if there are significantly more symbols than words in the text, this was helpful to remove many HTML tags in the dataset.
  • bullet and ellipsis filter: filter if there are many bullets or ellipsis in the text.
  • alphabetic word ratio filter: filter if the text contains a large proportion of characters other than language characters.

3. Training procedure & Models

For polyglot model training, EleutherAI's GPT-NeoX codebase was used, and with the great help of Stability AI, 256 A100s (8 * 32 nodes) were used to train the models.


Run set
480



3.1. Polyglot-ko-1.3B

The 1.3B model uses the following hyper parameters.
HyperparameterValue
n_parameters1,331,810,304
n_layers24
d_{model2,048
d_ff8,192
n_heads16
d_head128
n_ctx2,048
n_vocab30,003 / 30,080
Positional EncodingRotary Position Embedding (RoPE)
RoPE Dimensions64

Since the 1.3B model was enough to train without model parallelism, model parallelism was not applied and the total batch size was given to 1024. When training the 1.3B model, it was observed that overfitting occurred in both the training and validation data as the loss value dropped sharply from the point near 100,000 steps.
In fact, it was observed that the performance deteriorated sharply even in the result of running the inference. Therefore, the model was selected after evaluating and verifying the checkpoints before the point where the overfitting occurred.


3.2. Polyglot-ko-3.8B

The 3.8B model uses the following hyper parameters.
HyperparameterValue
n_parameters3,809,974,272
n_layers32
d_{model3,072
d_ff12,288
n_heads24
d_head128
n_ctx2,048
n_vocab30,003 / 30,080
Positional EncodingRotary Position Embedding (RoPE)
RoPE Dimensions64

We used pipeline parallelization for 3.8B model training and the overall batch size is the same as that of 1.3B. As shown in the figure above, similar to 1.3B, overfitting started to occur at a point near 100,000 steps, so we stopped model training.


3.3. Polyglot-ko-5.8B

The 5.8B model uses the following hyperparameters.
HyperparameterValue
n_parameters5,885,059,072
n_layers28
d_{model4,096
d_ff16,384
n_heads16
d_head256
n_ctx2,048
n_vocab30,003 / 30,080
Positional EncodingRotary Position Embedding (RoPE)
RoPE Dimensions64

Pipeline parallelization was applied to the 5.8B model as well, and the overall batch size was also reduced by 1/4 compared to the 1.3B and 3.8B models. Because the batch size was reduced to 1/4, overfitting did not occur up to 320,000 steps, which is the total training step, and it was confirmed that the model performance increased stably as the steps increased.


4. Performance Evaluation

We performed a few-shot prompting based evaluation of the polyglot Korean model using the KOBEST benchmark dataset. Using the prompts described in the paper, we performed a performance comparison between our model and the Korean large-scale model published on the huggingface hub, and the results are shown in the following tables. 'n' refers to the number of few-shot examples. For a fair comparison, all models were run under the same conditions and using the same prompts.
And you can also reproduce these results using the polyglot branch of lm-evaluation-harness and, the following scripts.
python main.py \
--model gpt2 \
--model_args pretrained='EleutherAI/polyglot-ko-1.3b' \
--tasks kobest_copa,kobest_hellaswag \
--num_fewshot $YOUR_NUM_FEWSHOT \
--batch_size $YOUR_BATCH_SIZE \
--device $YOUR_DEVICE \
--output_path $YOUR_PATH


4.1. COPA (F1 score)

Performance of COPA task
Modelparamsn=0n=5n=10n=50
skt/ko-gpt-trinity-1.2B-v0.51.2B0.66960.64770.64190.6514
kakaobrain/kogpt6.0B0.73450.72870.72770.7479
facebook/xglm-7.5B7.5B0.67230.67310.67690.7119
EleutherAI/polyglot-ko-1.3b1.3B0.71960.71930.72040.7206
EleutherAI/polyglot-ko-3.8b3.8B0.75950.76080.76380.7788
EleutherAI/polyglot-ko-5.8b5.8B0.77450.76760.77750.7887



4.2. HellaSwag (F1 score)

Performance of HellaSwag task
Modelparamsn=0n=5n=10n=50
skt/ko-gpt-trinity-1.2B-v0.51.2B0.40360.40.40110.4214
kakaobrain/kogpt6.0B0.45990.4560.46160.4754
facebook/xglm-7.5B7.5B0.42610.4370.44090.4517
EleutherAI/polyglot-ko-1.3b1.3B0.40130.39840.4170.4416
EleutherAI/polyglot-ko-3.8b3.8B0.44380.47860.47370.4822
EleutherAI/polyglot-ko-5.8b5.8B0.48530.4820.49680.5012


It was observed that the performance of our models increase as the number of parameters increases and the number of few-shots increases. In addition, it showed relatively good performance compared to the existing Korean large-scale models.


5. Checkpoints Merging & Model Upload

In order to upload the trained checkpoints to the Hugging Face model hub, it was necessary to merge the checkpoints in GPT-NeoX into a single checkpoint file and convert it to fit the Hugging Face Transformers. Therefore, we made scripts for it, and proceeded with the following procedure.
  1. Create config file: The gpt-neox config which is used for polyglot training cannot be used in the Hugging Face Transformers directly. Therefore, we made a config file by putting several parameters related to the model in config.json except for the parameters related to training among the yaml files used for training.
  2. Merge models by layers: The next step is to start merging the split checkpoints. First, so that each layer can become one checkpoint, the tensors of individual layers stored in each gpu are merged into one in the master process (gpu0) and saved by layer as shown in the figure below. (e.g model_layer_1.bin)
    
  3. Merging and converting model checkpoint: Convert the checkpoints saved for each layer as one file, convert them to the key used in Hugging Face Transformers, and save it as a state dict. (e.g: model.bin)
    
We successfully deployed our models and you can check out those models here.
Also, all of our work is being shared through our Github repository.


6. What’s Next?

Currently, we are training the Korean language model of 12.8B, and we plan to eventually expand it to 40B. We went through a lot of trial and error while making Korean models, and based on this experience, we are also making two types of multilingual models that can process many languages. The first is East-Asian model which consists Korean, Chinese, Japanese, Indonesian, Malay, Vietnamese, Thai and English. The second is Romance model which consists Spanish, Portuguese, French, Romanian and Italian.
With democratizing and promoting access to language model technology worldwide, We look forward to the development of research and academics in their country. If you are interested in our research or would like to join us, please feel free to email us at kevin.ko@tunib.ai or join our Discord channel. Thanks.