Textbooks Are All You Need?
Perhaps LLM's and Humans are more similar than we originally thought? 
Created on September 18|Last edited on September 18
Comment
 A few months ago, Microsoft released research focused on a new approach to creating synthetic datasets for Python programming. In their paper, "Textbooks are All You Need," the researchers introduced a unique way to curate datasets in a "textbook-like" format. This method aims to teach machine learning models how to reason, rather than simply processing massive amounts of data. The model they created, using just 1.3 billion parameters, demonstrated performance competitive with more complex models like Llama 7B. The research also explored the use of advanced models such as GPT-4 and GPT-3.5 to refine these datasets, filtering out irrelevant information and concentrating on useful examples for more efficient training. Additionally, they employed Language Models to generate a diverse range of full training examples, all based on prompts to align the data with a "textbook" format.
Textbooks are all you need V1 (Python Programming Only)
As can be seen, there were some amazing emergent abilities after training on the Code Exercises dataset, which was a mere 180M token Python dataset generated by GPT3.5. 

Results from Textbooks are all you need v1
The researchers said the goal of this dataset was to align the model to perform function completion tasks based on natural language instructions. Below is a sample from the dataset.

From a human prospective, the left sample below certainly seems for valuable in an educational setting, so the concept of giving models higher quality data to learn faster and more efficiently seems reasonable. 

The work showed some amazing results for shrinking down python programming models, however, the there was no mention as to whether this would translate to tasks besides programming.
Reasoning with Less Parameters
Well, it certainly looks promising in Microsoft's new paper "Textbooks Are All You Need II: phi-1.5 technical report" as the researchers were able to extend the application of their methodologies to areas beyond programming, such as common sense reasoning and natural language understanding. Using a similar "textbook" dataset for these tasks, they found that smaller models trained on this specialized dataset could achieve results competitive with their much larger counterparts while utilizing significantly fewer resources. 
Training Data
The training set for Phi-1.5 combines the existing 7B tokens from its predecessor, Phi-1, with approximately 20B tokens of newly generated synthetic data designed to teach the model common sense reasoning and general world knowledge. This new synthetic data is prompted from 20,000 topics and web dataset samples for diversity. It's noteworthy that the only non-synthetic part of the training set consists of the 6B tokens of filtered code data from Phi-1's training.


Filtered Web Data
To explore the impact of traditional web data, two other variants were developed: Phi-1.5-web-only and Phi-1.5-web. The first is trained purely on 95B tokens of filtered web data, while the second is trained on a mixture of web data, code data, and synthetic NLP data in a ratio of approximately 40%, 20%, 40%.
Performance Comparison
Phi-1.5-web-only has already shown to outperform all existing models of similar size. Even more intriguing is the Phi-1.5-web model, which, when trained along with synthetic data, performs comparably to models five times its size. Phi-1.5 also stands its ground against other models, showing the potential for synthetic data training.




Open Sourced
Phi-1.5 is open-sourced without any instruction fine-tuning or other stages of alignment to help the research community tackle pressing questions around large language models, including in-context learning, mechanistic interpretability, and strategies for hallucination mitigation and toxic content generation. Given its manageable size, it offers an easier platform for experimentation compared to larger models.
Conclusion
The Phi-1.5 model stands as a testament to the growing efficiency and potential of machine learning architectures. With its unique approach to training data selection and its utilization of advanced architectural features, Phi-1.5 offers not only remarkable performance but also serves as a pathway to more resource-efficient and effective machine learning models in the future.
The papers:
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.
