Microsoft unveils instruction pre-training: The future of LLM pre-training?
A new way to pretrain LLMs using augmented data!
Created on June 25|Last edited on June 25
Comment
In the quest to advance the capabilities of language models (LMs), a recent paper titled "Instruction Pre-Training: Language Models are Supervised Multitask Learners" offers a compelling approach that merges the benefits of supervised multitask learning with the scalability of pre-training. This paper, authored by researchers from Microsoft Research and Tsinghua University, delves into the intricacies of Instruction Pre-Training (Instruct PT) and demonstrates its superiority over traditional pre-training methods.
Building On Next-Token Prediction
Unsupervised multitask pre-training has been the bedrock of recent advancements in language models, with methods like Vanilla Pre-Training achieving notable success. However, these unsupervised methods fall short in harnessing the full potential of supervised multitask learning, which can offer better generalization and performance. The challenge lies in scaling supervised multitask learning effectively, prompting the need for innovative solutions.

Instruction Pre-Training
Instruction Pre-Training stands out as a novel framework that augments raw corpora with instruction-response pairs. These pairs are generated by an instruction synthesizer, which is fine-tuned on a diverse set of tasks. The key idea is to provide models with structured, task-oriented data during pre-training, thereby improving their ability to generalize to various downstream tasks.
The instruction synthesizer generates instruction-response pairs from raw text, leveraging existing datasets to ensure high coverage and correctness. In general pre-training from scratch, only part of the raw corpora is converted into instruction-augmented texts and mixed with fine-tuning data. For domain-adaptive continual pre-training, all domain-specific raw corpora are converted into instruction-augmented texts, enhancing domain-specific performance.
Performance Highlights
Instruct PT is remarkably data-efficient. For instance, a 500M model trained on 100B tokens matches the performance of larger models trained on three times the data. Additionally, models pre-trained with Instruct PT benefit more from subsequent instruction tuning, outperforming their Vanilla PT counterparts on benchmarks like MMLU.

In the biomedicine domain, Instruct PT achieves a performance score of 68.7 on PubMedQA, higher than Vanilla PT's 65.1 and significantly higher than the base Llama3-8B model's 59.8. In ChemProt, there is a notable improvement with Instruct PT scoring 47.2 compared to Vanilla PT's 42.4 and the base model's 27.6. In RCT, Instruct PT performs slightly better with a score of 73.4 compared to Vanilla PT's 72.4, indicating the robustness of the approach. The average score for Instruct PT in the biomedicine domain is 61.3, which surpasses both Vanilla PT's 58.4 and the base model's 53.6.
In the finance domain, Instruct PT shows a significant improvement with a score of 74.6 on ConvFinQA, compared to Vanilla PT's 62.9 and the base model's 49.9. Instruct PT also outperforms Vanilla PT on the Headline task, scoring 87.1 versus 84.7, and the base model's 81.1. However, on the NER task, Instruct PT scores 63.6, slightly below Vanilla PT's 64.9 but still showing substantial improvement over the base model's 72.8. The average score for Instruct PT in the finance domain is 74.7, higher than both Vanilla PT's 72.0 and the base model's 70.1.

Comparative Analysis
The study reveals that Llama3-8B models with Instruct PT can rival or even surpass the performance of much larger models like Llama3-70B. This is particularly evident in the finance domain, where Instruct PT achieves an average score of 74.7 compared to 71.9 for Llama3-70B. This highlights the efficiency and effectiveness of the Instruction Pre-Training approach in enhancing smaller models to compete with significantly larger models.
Instruction Pre-Training not only improves the overall performance across various benchmarks but also enhances the models' ability to generalize to unseen tasks. The use of instruction-response pairs provides additional supervised signals that help models learn better representations and improve their performance on both general and domain-specific tasks.
Future Directions
Future work could focus on improving the accuracy and quality of the synthetic instruction-response pairs to further reduce potential hallucinations. Exploring the balance between the quantity and quality of synthetic data and scaling Instruction Pre-Training to larger datasets and models could provide further insights.
Conclusion
Instruction Pre-Training emerges as a powerful method to advance language model performance. By integrating supervised learning signals into the pre-training process, this approach significantly enhances both general and domain-specific capabilities of language models. As researchers continue to refine and expand upon this framework, we can expect even more robust and efficient language models, capable of tackling a wider array of tasks with greater accuracy and efficiency.
This paper offers a promising glimpse into the future of language model training, highlighting the potential of Instruction Pre-Training to redefine the landscape of natural language processing. For those in the machine learning community, this represents an exciting step forward in the ongoing evolution of AI capabilities.
Add a comment