Skip to main content

Learning From SantaCoder: Experiments In Creating Open-Source Code Generation Models

BigCode's SantaCoder model gives us more than just a shiny new toy - researchers detail all the steps and experimentation it took to create a small yet competitive large language model for code generation.
Created on December 27|Last edited on December 28
Leading up to Christmas weekend, BigCode brought out Santa early with the release of SantaCoder, a new open-source, multilingual large language model for code generation. Its creation involved much experimentation, and in the end, performs similarly or better than other code generation models while staying at a comparatively small 1.1 billion parameters.

Before I go into the details if you want to play with SantaCoder you can do that here.

Creating SantaCoder

The goal of SantaCoder goes beyond just making a new code LLM; The researchers wanted to make sure that every step of the process was described in detail so that other researchers may build on top of their work. They explored many different options during the model creation process in search of what works best when training code LLMs.
SantaCoder's paper, currently in preprint and available for download at this link, provides all the details you would need to replicate their research. Researchers from various institutions, companies, and even independents contributed to SantaCoder's creation.
If you take a look at SantaCoder's Hugging Face page, you'll see a list of 8 different models to choose from. While the final model is the strongest one, the other models represent the steps towards it's creation. The researchers tried a many different things in determining the best way to create a code LLM, and those models are the result.

Preparing the base dataset

Each model uses a differently filtered dataset, but the base is all the same. The dataset used for SantaCoder is a subset of The Stack, a large dataset of public GitHub code curated by BigCode. Version 1.1 was used, which includes the first wave of opt-out code removal requests.
The subset includes only Python, Java, and JavaScript code. It was run through a thorough redaction process (described in detail in section 4 of the paper) to remove any personally identifiable information, such as emails, IP addresses, and secret keys.
In the end, the final base dataset includes 268 GB of Python, Java, and JavaScript files.

Experimenting with dataset preprocessing

One idea to improve the code LLM training process involved preprocessing, or filtering, the training data in some way. The researchers explored 4 different options which would be applied to the base dataset before training:
  • GitHub stars. Filter to only include code from GitHub repositories that have at least 5 stars. This selects for popularity and, therefor, potentially higher code quality. A 5-star minimum cuts out around 60% of the data.
  • Comment-to-code ratio. Good documentation is a good indicator of high-quality code, so filtering by some comment-to-code ratio should return higher-quality code. Filtering for a minimum of 1% and a maximum of 80% comment percentage cuts out around 20% of the data.
  • Stronger near-deduplication. While The Stack already went through a process of deduplication (removing duplicate content), more aggressive filtering could improve performance. Deduplication is tricky and can be prone to false positives, but in the end, the researchers were able to cut out around 16%-20% of the data.
  • Tokenizer fertility. For language models, data is fed in as tokens, breaking language into chunks more easily processable. It was found that files with a low token-to-character ratio tended to indicate lower-quality code, so removing low-ratio files could cut out bad code. Setting a ratio cutoff of around 2.5-2.9 cut out around 4%-5% of the data.


Experimenting with architecture

Two ideas were considered when deciding on architecture options for SantaCoder:
  • Fill-in-the-middle. This technique rearranges input tokens for the benefit of the model learning to infill code snippets while not harming its regular left-to-right generation ability.
  • Multi query attention vs multi head attention. MHA is standard for transformer models, but MQA changes things up a little by sharing key and value embeddings between heads, lowering bandwidth and speeding up inference. It also lowers parameter count from 1.3 billion to the 1.1 billion of MHA implementation.

Learning from SantaCoder

Evaluating SantaCoder's variations

SantaCoder's various models were evaluated on two metrics: left-to-right and fill-in-the-middle.
Left-to-right evaluates whether a model can complete a useful code snippet given a prompt. In SantaCoder's case, the prompt is in the form of the start of a function with parameters and descriptive comments.
In the below graph, models with each dataset filtering method are compared against each other on the HumanEval benchmark. These models all implement MQA and FIM from the architecture section above. The performance is a percentage probability that a model passes the test within 100 attempts.
You can see that, surprisingly, the Github stars filtered dataset causes a dramatic loss in performance. However, the comment-filtered and deduplication datasets give a good boost to performance. The final model implements both the comments and deduplication preprocessing steps.

Fill-in-the-middle evaluates if a model can substitute a masked section of code with generated code to create a working function.
Here, similar to above, each model is compared to the other. Again, the GitHub stars filter ruins performance. The final model has more middling results this time around.


Comparing to other models

SantaCoder's final model was compared against a few other coder generation models on the left-to-right and fill-in-the-middle benchmarks.

In this table, you can first notice SantaCoder's parameter size compared to the other models. Performance values in this graph are the percentage probability that a model would pass any given test.
For left-to-right evaluation, models were given 100 chances to generate a working solution for a single prompt. For fill-in-the-middle, models would have to exactly match the masked lines in an example code snippet from the HumanEval dataset.
SantaCoder defeats the other open-source code LLMs on nearly all metrics aside from the obvious losses against CodeGen-mono and Codex which evaluate only on Python tests. It does this, and yet it's less than half the size of the smallest opponent model.

Try SantaCoder for yourself

SantaCoder is open-source and you can check the Hugging Face repository for more information on that. Check out the demo app for a quick way to play with it.
Be sure to read SantaCoder's paper for all the details.

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.