Skip to main content

AutoCoder: A New Benchmark in LLM Code Generation

A new model outperforming GPT-4 on HumanEval!
Created on June 5|Last edited on June 5
Researchers from the University of Connecticut and AIGCode have introduced AutoCoder, a Large Language Model (LLM) designed for code generation that surpasses previous models like GPT-4 Turbo and GPT-4o. AutoCoder achieved a pass@1 score of 90.9% on the Human Eval benchmark, outperforming the 90.2% scores of GPT-4 Turbo (April 2024) and GPT-4o.

Key Innovations

AutoCoder stands out due to its versatile code interpreter, which can install external packages, unlike its predecessors that are limited to built-in packages. This feature significantly enhances its ability to execute a wide variety of code, making it more adaptable for real-world applications.

AIEV-INSTRUCT: The Training Methods

AutoCoder's impressive performance is attributed to a novel training method called AIEV-INSTRUCT (Instruction Tuning with Agent-Interaction and Execution-Verified). This method involves creating a multi-turn dialogue dataset through agent interaction and external code execution verification. It reduces dependence on proprietary models and ensures the accuracy of the generated code dataset.

Stages of Training

AutoCoder's training process is divided into two main stages: the Teaching Stage and the Self-Learning Stage.
In the Teaching Stage, the process begins with initializing the necessary components, where proprietary models like GPT-4 Turbo are used as agents in the roles of a questioner and a programmer. This setup ensures the generated data is diverse and does not converge to a specific dialogue template. Additionally, a Docker container is initialized as a Code Interpreter, responsible for installing required external packages and executing the code for verification.
The next step involves proposing coding problems and solutions. GPT-4 Turbo generates these based on open-source code snippets and includes unit tests to verify the accuracy of the code. This generated code is then executed within the Code Interpreter. If any errors occur, the error messages (stderr) are appended to the dialogue messages and provided to the questioner. The questioner generates a natural language description based on these errors, prompting the programmer to continue modifying the code. This iterative process of generating, executing, and refining the code continues until it passes all unit tests. Once the code executes successfully, the complete dialogue, including the problem, solution, unit tests, and execution feedback, is added to the dataset.
When the student model, AutoCoder, surpasses the proprietary teacher model in accuracy, the training transitions to the Self-Learning Stage. In this stage, AutoCoder itself assumes the roles of questioner and programmer, continuing the iterative process of generating and verifying code autonomously. This reduces reliance on the proprietary models and enables the model to refine its capabilities based on its own outputs.

Dataset Generation and Processing

The dataset for AutoCoder is generated using the AIEV-INSTRUCT method. Initially, data from Magicoder-Evol-Instruct and Magicoder-OSS-Instruct datasets, comprising 186,000 original code entries, are used. After de-duplication and processing through the AIEV-INSTRUCT pipeline, the dataset includes 169,000 high-quality code instruction samples with 241,000 rounds of dialogue. The dataset undergoes decontamination to prevent overlap with benchmark test sets like HumanEval, MBPP, DS-1000, and MultiPL-E. Any data entry with more than 90% similarity to these test sets is removed, ensuring the integrity of the evaluation process.

Training Configuration

AutoCoder is trained on a robust hardware setup, utilizing 40 A100 GPUs (80GB each) on a Simple Linux Utility for Resource Management (SLURM) cluster. The NVIDIA Collective Communications Library (NCCL) manages communication between GPUs to ensure efficient training.
The training process involves fine-tuning the base models, Deepseek-Coder 6.7B and 33B, using the AutoCoder-AIEV-Instruct dataset. Special tokens are added using the AutoTokenizer package from the transformer library to enable the Code Interpreter feature. Training parameters include a batch size of 8 per GPU, gradient accumulation steps of 4, a learning rate of 5e-5, and mixed precision (bf16). The maximum sequence length is set to 5120, and the training spans 2 epochs. The training employs the ZeRO-Stage 3 feature from the deepspeed library for efficient parameter partitioning.

Evaluation

AutoCoder's performance is evaluated on several benchmarks, including HumanEval, HumanEval+, MBPP, MBPP+, MultiPL-E, and DS-1000. Its Pass@1 scores demonstrate superior performance compared to state-of-the-art models. For instance, on the HumanEval benchmark, AutoCoder achieved the highest Pass@1 score, and its multilingual code generation capabilities were tested using the MultiPL-E benchmark, where it excelled in languages like Java, C++, and Rust. In data science tasks, AutoCoder matched or surpassed the performance of leading models like GPT-4 Turbo on the DS-1000 dataset.

Conclusion

AutoCoder represents a significant advancement in code generation LLMs, offering higher accuracy and versatility than previous models. The innovative AIEV-INSTRUCT method not only improves the model's performance but also reduces reliance on expensive proprietary datasets. This open-source model and its training methodology provide valuable resources for the AI community, paving the way for further innovations in code generation.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.