Skip to main content

METRO: An LLM Pre-Training Solution That Outperforms Others

Microsoft research release paper detailing their work into METRO, a new solution to pre-training accurate large-scale language models at comparably efficient training speeds.
Created on April 25|Last edited on April 26
Microsoft researchers Xiong et al. have released a paper describing work they had put in to a project on large-scale language learning model training in the last year. This paper outlines the continued exploration of ELECTRA-style model pre-training, and goes onto present a solution capable of training models that outperform existing large-scale language models on numerous benchmarking tasks.


How does METRO help pre-train language models?

The team calls their solution METRO (Model-generated dEnoising TRaining Objective). METRO greatly extends the capability of ELECTRA-style denoising training methods with a suite of extra techniques and adjustments covering a wide range of utility.
The style of language model training in focus here is that of word masking: the process of removing (masking) words in a sentence and letting the model learn to fill in the blank. Prior to modifying the input with masking, models are trained to exactly replicate the input sentences. This method is called MLM (masked-language modeling) and is the method which BERT, a popular language model, is known for using.
METRO uses a modified version of this MLM technique, wherein instead of using masked words for the model to fill in the blanks with, METRO replaces certain words with other words with the help of an auxiliary language model, resulting in the model learning how it should handle these corrupted sentences. This modification of input data is greatly varied compared to simple masking, allowing for more dynamic learning.
Beyond this, a suite of varying techniques are included to improve efficiency of large-scale model training that better retain the use of highly effective model architectures that are prone to instability.

METRO's trained models and benchmarks

Using METRO, the team trained a series of autoencoders under the name METRO-LM.
The paper illustrates that the METRO-LM models are able to outperforms many other models such as DeBERTa and ERNIE on a variety of benchmarking tests such as MNLI and GLUE, all while maintaining highly efficient training times, relatively small model sizes, and impressive scalability.


Find out more

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.