Skip to main content

Microsoft Publishes Orca: A 13B Parameter ChatGPT Rival

Microsoft researchers are quickly approaching ChatGPT performance with explanation tuning a 13B parameter large language model (LLM.) Read more in this article.
Created on June 14|Last edited on June 15
AI researchers are constantly working to build better, smarter, and more powerful models. An interesting question that has recently surfaced is whether these models can supervise their own or other AI models' behavior. Research suggests that it's possible by sampling output from a model, generating revisions, and fine-tuning the model based on these revisions. This strategy has yielded models with more controlled behavior.

False Promises...

A wave of studies has employed Large Foundation Models (LFMs) like ChatGPT and GPT-4 as teachers to generate expansive datasets. Smaller models, such as Alpaca, WizardLM, and Vicuna, are then trained using these datasets. Though these models can match their teachers' style, but it’s important to note that they struggle to replicate their advanced reasoning and comprehension skills.

Misleading Evals

While conventional evaluation methods suggest that Vicuna retains 92% of ChatGPT’s quality, closer inspection reveals a much different picture. When compared against human labels on reasoning benchmarks, Vicuna maintains just 64% of ChatGPT’s quality on professional and academic exams and a mere 48% on complex benchmarks. This discrepancy reveals limitations in existing evaluation protocols and underscores the models' significant shortcomings in reasoning and comprehension.

Microsofts Solution

To tackle these challenges, researchers from Microsoft propose two main strategies: explanation tuning and simply scaling the tasks and instruction datasets. Explanation tuning involves augmenting pairs of queries and responses with detailed explanations from GPT-4, providing additional signals for learning. The second strategy involves utilizing an extensive public collection of tasks and instructions, such as the Flan 2022 Collection. These methods have demonstrated success. For instance, 1 million responses sampled from 5 million ChatGPT responses led to acquiring GPT-4 responses. ChatGPT, used as a teaching assistant, aided in progressive learning. The researchers train a LLM model with 13 Billion parameters using explanation tuning and the results were very surprising!

Explanation Tuning

Explanation tuning provides notable advantages over the ShareGPT approach used for training models like Vicuna. The fundamental strength of explanation tuning lies in its ability to offer deeper insight into the reasoning process of teacher models. It augments the standard imitation learning data, which consist of query-response pairs, with detailed explanations from a model like GPT-4. These explanations provide the student model with additional context and understanding of how the teacher model generates responses, thereby enriching the learning signals.


In contrast, Vicuna was trained using ShareGPT, a platform for sharing user-generated conversations with language models. ShareGPT, while valuable for capturing human-like conversational style, is limited in task diversity and reasoning depth. It favors creative content generation and information-seeking queries, but often overlooks complex reasoning tasks. Moreover, ShareGPT's user-reliant data collection poses scalability challenges, further limiting the models' learning experiences.
Explanation tuning equips the student model to more closely mimic the "thought" process of large language models, improving performance on sophisticated reasoning tasks. This makes explanation tuning a substantial improvement over the ShareGPT approach.

Training

The training process of the language model Orca involved a two-stage process. Initially, 5 million instructions, referred to as FLAN-5M, were generated and used for the initial training. An additional dataset, FLAN-1M, was created by randomly sampling 1 million queries from the FLAN-5M set. Azure OpenAI API was utilized to collect responses from ChatGPT (GPT-3.5-turbo) for the FLAN-5M dataset, and GPT-4 for the FLAN-1M dataset.

The Datasets

Orca was then trained in two stages. The initial training used the FLAN-5M dataset, with responses from ChatGPT. The second stage utilized the FLAN-1M set, using responses from GPT-4. This method took advantage of ChatGPT as an intermediary teacher assistant, which helped bridge the capacity gap between Orca and GPT-4, and also provided cost and time benefits.

Efficiency

The training was carried out on 20 NVIDIA A100 GPUs, and data collection from ChatGPT and GPT-4 took 2 and 3 weeks respectively. Upon completion of the training, the model's performance was rigorously evaluated on multiple abilities such as writing, comprehension, analytical, mathematical, and logical reasoning.

Evaluation

Orca was then compared against several baseline models, including Text-Davinci-003 (TD-003), ChatGPT, GPT-4, and Vicuna, each of which has distinct capabilities and is optimized for different tasks. These comparisons help to assess the effectiveness of the training and the overall performance of the Orca model.
Furthermore, a more comprehensive evaluation system has been proposed to assess the generative, reasoning, and comprehension abilities of a new model, Orca. This includes auto-evaluation with GPT-4 on existing evaluation sets, academic benchmarks like Big-Bench Hard and TruthfulQA, and professional and academic exams like SAT, LSAT, GRE, GMAT from AGIEval.
The evaluation results demonstrate that Orca, following its two-stage training process, is competitive with larger models like GPT-4 and exhibits substantial improvements over models like Vicuna. According to GPT-4's assessment across various datasets, Orca maintained 95% of the quality displayed by ChatGPT and 85% of the quality demonstrated by GPT-4. This is a noteworthy achievement, as it indicates a 10-point advancement over Vicuna when evaluating the models in aggregate.Here are some of the result charts from the paper.

Results on the Big Bench Hard Dataset
Results on the AGIEval dataset

Open Source Milestone

Orca's impressive performance, despite its relatively small size of 13 billion parameters, is testament to the effectiveness of explanation tuning and model generated data. By leveraging explanation tuning and progressive learning, Orca showcases that size isn't the sole determinant of a model's capability. Microsoft's plan to open-source Orca is also super exciting, and it should be interesting to see what the community does with this model!
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.