The Stack: BigCode's New 3 TB Dataset Of Permissively Licensed Code

BigCode has revealed their first work today, a new 3 TB dataset of permissively licensed code scraped from GitHub, including 30 languages.

Teli Davies

Created on October 27|Last edited on October 27

Comment

BigCode, a project first announced last month as a collaboration between Hugging Face and ServiceNow to follow in the footsteps of BigScience and their BLOOM LLM model, announced their first work today: A new large dataset of permissively licensed code scraped from GitHub.
﻿
The StackTo create The Stack, the team used GH Archive to collect code files from publicly archived GitHub repositories. Over 92 TB of data was collected in the initial haul, but was whittled down to 3 TB after filtering for target extensions and licensing requirements.
The 3 TB dataset includes around 30 languages in total, including many popular ones like Python, C#, JavaScript and others; However, some notable languages appear to be left out, such as Kotlin and Swift. A near-deduplicated version at a size of 1.5 TB was also created. A final small subset is also available which includes 10,000 samples for each language present in the original dataset, coming in at only 2.6 GB total.
The BigCode team plan to keep updating the dataset to include new languages or make other improvements as well. Check here if you'd like to know more about joining the effort. If your code has been added to the dataset, but you would like it removed or exempt your code from future updates, you can find more information on this page.
Find out more﻿Learn more about The Stack at it's project page by clicking here.﻿
﻿Read the full research paper associated with The Stack's creation by clicking here.﻿
Visit BigCode's Hugging Face and GitHub organizations to follow their work.
﻿

Add a comment

Tags: ML News

Iterate on AI agents and models faster. Try Weights & Biases today.