StarCoder: A New Open Source Large Language Model for Code
HuggingFace vs. OpenAI?
Created on May 5|Last edited on May 5
Comment
BigCode, an open scientific collaboration led by Hugging Face and ServiceNow, has introduced StarCoder and StarCoderBase, two state-of-the-art large language models (LLMs) for code.
The project aims to empower machine learning and open source communities through open governance and the responsible development of code LLMs. With closed-source models like OpenAI's GPT-4 gaining traction, it is great to see continued focus on open source models, promoting greater accessibility and collaboration.
Open Sourced LLM
StarCoder and StarCoderBase are 15.5B parameter models trained on The Stack (v1.2), a dataset containing 80+ programming languages from GitHub, excluding opt-out requests. These models use Multi Query Attention, a context window of 8192 tokens (twice the tokens of GPT-3), and were trained using the Fill-in-the-Middle objective on 1 trillion tokens sourced from The Stack, which is a large collection of permissively licensed GitHub repositories with inspection tools and even an opt-out process. The StarCoder model was trained using 512 Tesla A100 GPUs over a period of 24 days. Below is some results that examine performance on tasks like natural language reasoning and programming.
Natural Language Reasoning Results

HumanEval Results

HumanEval Dataset
The HumanEval dataset is used to evaluate functional correctness in programming models. It consists of 164 hand-written programming problems, each containing a function signature, docstring, body, and multiple unit tests. These hand-written tasks are essential for evaluation since models are typically trained on a large portion of GitHub, which already includes various problem solutions.
HumanEval focuses on assessing language comprehension, reasoning, algorithms, and simple mathematics, providing a benchmark for measuring a model's problem-solving capabilities.
An Open Sourced Future
As closed-source models are becoming more prevalent, the StarCoder project highlights the importance of open governance and collaboration. By focusing on open source models, the BigCode project fosters greater accessibility for developers and researchers to build on this foundation, and create new applications for the benefit of the entire community.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.