Polyglot-Ko: Open-Source Korean Autoregressive Language Model

Technical Report of Polyglot-Ko

Created on November 13|Last edited on November 13

Comment

﻿
1. What's the Polyglot project?2. Datasets2.1. Data Collection2.2. Data Analysis2.3. Data preprocessing2.3.1. Personally Identifiable Information Filtering2.3.2. Quality Filtering3. Training procedure & Models3.1. Polyglot-ko-1.3B3.2. Polyglot-ko-3.8B3.3. Polyglot-ko-5.8B4. Performance Evaluation4.1. COPA (F1 score)4.2. HellaSwag (F1 score)5. Checkpoints Merging & Model Upload6. What’s Next?
﻿
﻿
1. What's the Polyglot project?Polyglot is a project making multilingual language models. So far, various multilingual models such as mBERT, BLOOM and XGLM have been released. but “why should we make new multilingual models?” Before answering the question, we would like to ask, "Why do people around the world make monolingual models in their language even though there are already many multilingual models?" We would like to point out there is a dissatisfaction with the non-English language performance of the current multilingual models as one of the most significant reason. So we want to make multilingual models with higher non-English language performance. This is the reason we need to make multilingual models again and why we name them 'Polyglot'.
But Polyglot Korean models to be introduced in this post are not multilingual. When we started our research, we have already had 1.2TB of Korean data collected by TUNiB. Before we collected a large amount of multilingual data, we decided to try making Korean models with the dataset we already had. With Korean model training, 1) We can practice multi-node training, 2) These models could be used for performance comparison with the multilingual models, and 3) The models themselves would be helpful to many Korean companies and researchers. For these reasons, we made the Korean models before the multilingual models.
﻿
2. Datasets
2.1. Data CollectionWe collected 863 GB of Korean language data (1.2TB before processing), a large-scale dataset curated with TUNiB. The data collection process has abided by South Korean laws. This dataset was collected for the purpose of training Polyglot-Ko models, so it will not be released for public use.

SourceSize (GB)Link
Korean blog posts682.3-
Korean news dataset87.0-
Modu corpus26.4corpus.korean.go.kr
Korean patent dataset19.0-
Korean Q & A dataset18.1-
KcBert dataset12.7github.com/Beomi/KcBERT
Korean fiction dataset6.1-
Korean online comments4.2-
Korean wikipedia1.4ko.wikipedia.org
Clova call< 1.0github.com/clovaai/ClovaCall
Naver sentiment movie corpus< 1.0github.com/e9t/nsmc
Korean hate speech dataset< 1.0-
Open subtitles< 1.0opus.nlpl.eu/OpenSubtitles.php
AIHub various tasks datasets< 1.0aihub.or.kr
Standard Korean language dictionary< 1.0stdict.korean.go.kr/main/main.do
﻿
﻿
2.2. Data AnalysisThe reason we want to analyze data is to reduce the risk that can arise during training and inference. For example, Empty or too short text data, repeated words and characters, and duplicated data instances could be problematic during model training. In the inference stage,  personally identifiable information (PII) could be problematic.
As a result of examining the data from the perspective of reducing these risks, we were able to classify them into four types:
Data available to train: mainly news and wikipedia data, which have a lot of information with long enough text.
Data that needs contextual information around it to train: mainly blog data and news data. It consisted a lot of short texts which were scrapped incorrectly.
Data with a lot of hate speech: mainly comments datasets from some community websites.
NLP task data: NLP task data like text classification or entity recognition. It can be used for model training, but needs to be dealt with separately when evaluating the model.
While examining the data types, there were quality issues that had to be processed in the data to be learned. Since these issues can cause huge problems in model training, we have organized them and added them to the text preprocessing work.
Empty text: no text in the data instance
Unnecessary space: multiple unnecessary spaces in the data instance
De-identification: personally identifiable information in the data instance
Uncleaned HTML tags: HTML tags are not removed
Deduplication: duplicated data instance (based on exact match)
Broken code: only part of HTML or Markdown exists
Short text: too short data instance
Repeated character: repeated characters in the data instance
In particular, since the purpose of polyglot model is Korean text generation, html-like code has been removed as much as possible.
Another important part of preprocessing data is the data length. The longer the text length, the more context information can be included in model training, and the shorter the text length, the smaller the context information of the text to be generated. What we want to explore is to see how the text lengths of each dataset we collect are distributed. We collected about 50 datasets, sampled them, and looked at the length distribution of each data as a box plot.
﻿
The figure above is the token length distribution for all datasets we’ve collected. You can see that most of the datasets had short texts, and only a few datasets had long texts. Among these, long texts were mostly news articles and blogs. Based on these results, it was possible to establish a filter condition for the text length, and it could be expected that the contextual information of information about news or blogs was most learned by the model during training.
﻿
2.3. Data preprocessingBased on these analysis results, we performed the following data preprocessing.
2.3.1. Personally Identifiable Information FilteringWe have tried to minimize the damage caused by personally identifiable information leakage by masking the following information in the data. However, although some data may be harmful, it will help us to use the language model for various tasks. Therefore, we have masked only the bare minimum of critical personal information.
Phone Number: Masked landline and wireless phone (cell phone) numbers in the text to ‘<|tel|>’
Resident Registration Number: Masked the resident registration number in the text to ‘<|rrn|>’
Bank Account Number: Masked the bank account number in the text to ‘<|acc|>’. We processed the 12 banks account: IBK, KB, NH, Shinhan, Woori, KEB, City Bank, DGB, BNK, SC, K-Bank, Kakao Bank
2.3.2. Quality FilteringWe borrowed the filtering logic described in [Gopher](https://arxiv.org/abs/2112.11446) and used it for data preprocessing. We have determined that these filtering methods can sufficiently solve the problems mentioned above.
document length filter: filter the text instance If the length of a certain condition is not met.
mean length word filter: filter if the average word length of text was unusually long.
symbol to word ratio filter: filter if there are significantly more symbols than words in the text, this was helpful to remove many HTML tags in the dataset.
bullet and ellipsis filter: filter if there are many bullets or ellipsis in the text.
alphabetic word ratio filter: filter if the text contains a large proportion of characters other than language characters.
 
3. Training procedure & ModelsFor polyglot model training, EleutherAI's GPT-NeoX codebase was used, and with the great help of Stability AI, 256 A100s (8 * 32 nodes) were used to train the models.
﻿
﻿
Run set480
﻿
﻿
3.1. Polyglot-ko-1.3BThe 1.3B model uses the following hyper parameters.

HyperparameterValue
n_parameters1,331,810,304
n_layers24
d_{model2,048
d_ff8,192
n_heads16
d_head128
n_ctx2,048
n_vocab30,003 / 30,080
Positional EncodingRotary Position Embedding (RoPE)
RoPE Dimensions64
﻿
Since the 1.3B model was enough to train without model parallelism, model parallelism was not applied and the total batch size was given to 1024. When training the 1.3B model, it was observed that overfitting occurred in both the training and validation data as the loss value dropped sharply from the point near 100,000 steps. 
In fact, it was observed that the performance deteriorated sharply even in the result of running the inference. Therefore, the model was selected after evaluating and verifying the checkpoints before the point where the overfitting occurred.
﻿
3.2. Polyglot-ko-3.8BThe 3.8B model uses the following hyper parameters.

HyperparameterValue
n_parameters3,809,974,272
n_layers32
d_{model3,072
d_ff12,288
n_heads24
d_head128
n_ctx2,048
n_vocab30,003 / 30,080
Positional EncodingRotary Position Embedding (RoPE)
RoPE Dimensions64
﻿
We used pipeline parallelization for 3.8B model training and the overall batch size is the same as that of 1.3B. As shown in the figure above, similar to 1.3B, overfitting started to occur at a point near 100,000 steps, so we stopped model training.
﻿
3.3. Polyglot-ko-5.8BThe 5.8B model uses the following hyperparameters.

HyperparameterValue
n_parameters5,885,059,072
n_layers28
d_{model4,096
d_ff16,384
n_heads16
d_head256
n_ctx2,048
n_vocab30,003 / 30,080
Positional EncodingRotary Position Embedding (RoPE)
RoPE Dimensions64
﻿
Pipeline parallelization was applied to the 5.8B model as well, and the overall batch size was also reduced by 1/4 compared to the 1.3B and 3.8B models. Because the batch size was reduced to 1/4, overfitting did not occur up to 320,000 steps, which is the total training step, and it was confirmed that the model performance increased stably as the steps increased.
﻿
4. Performance EvaluationWe performed a few-shot prompting based evaluation of the polyglot Korean model using the KOBEST benchmark dataset. Using the prompts described in the paper, we performed a performance comparison between our model and the Korean large-scale model published on the huggingface hub, and the results are shown in the following tables. 'n' refers to the number of few-shot examples. For a fair comparison, all models were run under the same conditions and using the same prompts.
And you can also reproduce these results using the polyglot branch of lm-evaluation-harness and, the following scripts.
python main.py \
--model gpt2 \
--model_args pretrained='EleutherAI/polyglot-ko-1.3b' \
--tasks kobest_copa,kobest_hellaswag \
--num_fewshot $YOUR_NUM_FEWSHOT \
--batch_size $YOUR_BATCH_SIZE \
--device $YOUR_DEVICE \
--output_path $YOUR_PATH
﻿
4.1. COPA (F1 score)
Performance of COPA task

Modelparamsn=0n=5n=10n=50
skt/ko-gpt-trinity-1.2B-v0.51.2B0.66960.64770.64190.6514
kakaobrain/kogpt6.0B0.73450.72870.72770.7479
facebook/xglm-7.5B7.5B0.67230.67310.67690.7119
EleutherAI/polyglot-ko-1.3b1.3B0.71960.71930.72040.7206
EleutherAI/polyglot-ko-3.8b3.8B0.75950.76080.76380.7788
EleutherAI/polyglot-ko-5.8b5.8B0.77450.76760.77750.7887
﻿
﻿
4.2. HellaSwag (F1 score)
Performance of HellaSwag task

Modelparamsn=0n=5n=10n=50
skt/ko-gpt-trinity-1.2B-v0.51.2B0.40360.40.40110.4214
kakaobrain/kogpt6.0B0.45990.4560.46160.4754
facebook/xglm-7.5B7.5B0.42610.4370.44090.4517
EleutherAI/polyglot-ko-1.3b1.3B0.40130.39840.4170.4416
EleutherAI/polyglot-ko-3.8b3.8B0.44380.47860.47370.4822
EleutherAI/polyglot-ko-5.8b5.8B0.48530.4820.49680.5012
﻿
﻿
It was observed that the performance of our models increase as the number of parameters increases and the number of few-shots increases. In addition, it showed relatively good performance compared to the existing Korean large-scale models.
﻿
5. Checkpoints Merging & Model UploadIn order to upload the trained checkpoints to the Hugging Face model hub, it was necessary to merge the checkpoints in GPT-NeoX into a single checkpoint file and convert it to fit the Hugging Face Transformers. Therefore, we made scripts for it, and proceeded with the following procedure.
Create config file: The gpt-neox config which is used for polyglot training cannot be used in the Hugging Face Transformers directly. Therefore, we made a config file by putting several parameters related to the model in config.json except for the parameters related to training among the yaml files used for training.
Merge models by layers: The next step is to start merging the split checkpoints. First, so that each layer can become one checkpoint, the tensors of individual layers stored in each gpu are merged into one in the master process (gpu0) and saved by layer as shown in the figure below. (e.g model_layer_1.bin)
﻿
Merging and converting model checkpoint: Convert the checkpoints saved for each layer as one file, convert them to the key used in Hugging Face Transformers, and save it as a state dict. (e.g: model.bin)
﻿
We successfully deployed our models and you can check out those models here. 
1.3B: https://huggingface.co/EleutherAI/polyglot-ko-1.3b﻿
3.8B: https://huggingface.co/EleutherAI/polyglot-ko-3.8b﻿
5.8B: https://huggingface.co/EleutherAI/polyglot-ko-5.8b﻿
Also, all of our work is being shared through our Github repository.
﻿
6. What’s Next?Currently, we are training the Korean language model of 12.8B, and we plan to eventually expand it to 40B. We went through a lot of trial and error while making Korean models, and based on this experience, we are also making two types of multilingual models that can process many languages. The first is East-Asian model which consists Korean, Chinese, Japanese, Indonesian, Malay, Vietnamese, Thai and English. The second is Romance model which consists Spanish, Portuguese, French, Romanian and Italian. 
With democratizing and promoting access to language model technology worldwide, We look forward to the development of research and academics in their country. If you are interested in our research or would like to join us, please feel free to email us at kevin.ko@tunib.ai or join our Discord channel. Thanks.
﻿

Source	Size (GB)	Link
Korean blog posts	682.3	-
Korean news dataset	87.0	-
Modu corpus	26.4	corpus.korean.go.kr
Korean patent dataset	19.0	-
Korean Q & A dataset	18.1	-
KcBert dataset	12.7	github.com/Beomi/KcBERT
Korean fiction dataset	6.1	-
Korean online comments	4.2	-
Korean wikipedia	1.4	ko.wikipedia.org
Clova call	< 1.0	github.com/clovaai/ClovaCall
Naver sentiment movie corpus	< 1.0	github.com/e9t/nsmc
Korean hate speech dataset	< 1.0	-
Open subtitles	< 1.0	opus.nlpl.eu/OpenSubtitles.php
AIHub various tasks datasets	< 1.0	aihub.or.kr
Standard Korean language dictionary	< 1.0	stdict.korean.go.kr/main/main.do

Hyperparameter	Value
n_parameters	1,331,810,304
n_layers	24
d_{model	2,048
d_ff	8,192
n_heads	16
d_head	128
n_ctx	2,048
n_vocab	30,003 / 30,080
Positional Encoding	Rotary Position Embedding (RoPE)
RoPE Dimensions	64

Hyperparameter	Value
n_parameters	3,809,974,272
n_layers	32
d_{model	3,072
d_ff	12,288
n_heads	24
d_head	128
n_ctx	2,048
n_vocab	30,003 / 30,080
Positional Encoding	Rotary Position Embedding (RoPE)
RoPE Dimensions	64

Hyperparameter	Value
n_parameters	5,885,059,072
n_layers	28
d_{model	4,096
d_ff	16,384
n_heads	16
d_head	256
n_ctx	2,048
n_vocab	30,003 / 30,080
Positional Encoding	Rotary Position Embedding (RoPE)
RoPE Dimensions	64

Model	params	n=0	n=5	n=10	n=50
skt/ko-gpt-trinity-1.2B-v0.5	1.2B	0.6696	0.6477	0.6419	0.6514
kakaobrain/kogpt	6.0B	0.7345	0.7287	0.7277	0.7479
facebook/xglm-7.5B	7.5B	0.6723	0.6731	0.6769	0.7119
EleutherAI/polyglot-ko-1.3b	1.3B	0.7196	0.7193	0.7204	0.7206
EleutherAI/polyglot-ko-3.8b	3.8B	0.7595	0.7608	0.7638	0.7788
EleutherAI/polyglot-ko-5.8b	5.8B	0.7745	0.7676	0.7775	0.7887

Model	params	n=0	n=5	n=10	n=50
skt/ko-gpt-trinity-1.2B-v0.5	1.2B	0.4036	0.4	0.4011	0.4214
kakaobrain/kogpt	6.0B	0.4599	0.456	0.4616	0.4754
facebook/xglm-7.5B	7.5B	0.4261	0.437	0.4409	0.4517
EleutherAI/polyglot-ko-1.3b	1.3B	0.4013	0.3984	0.417	0.4416
EleutherAI/polyglot-ko-3.8b	3.8B	0.4438	0.4786	0.4737	0.4822
EleutherAI/polyglot-ko-5.8b	5.8B	0.4853	0.482	0.4968	0.5012

Add a comment