A New Definition of Open Source AI, with training data included
Created on October 29|Last edited on October 29
Comment
The Open Source Initiative has released version 1.0 of the Open Source AI Definition, setting a new standard for AI development that includes full transparency of training data, along with code and model parameters. The initiative underscores the importance of ensuring that AI systems adhere to the core principles of open-source software, such as transparency, reuse, and collaboration.
AI systems, under this new definition, must not only provide access to the code and model architecture but also to the underlying data used for training. The goal is to make these systems truly open by enabling other developers to understand, modify, and rebuild the models independently. This requirement pushes beyond existing open-source practices by demanding greater granularity in how the data is disclosed and shared.
Training Data as a Fundamental Requirement
The definition emphasizes that all components necessary for meaningful modification of machine-learning systems must be made available, including detailed information about training data. For example, developers must provide descriptions of the data sources, scope, labeling methods, and filtering processes. In cases where third-party data is used, instructions for obtaining it must be included, even if the data is not directly shareable. This ensures that others can replicate or build similar systems using equivalent datasets.
Code and Model Parameters Open for Collaboration
In addition to data, the complete code required for training and operating the system must be disclosed under OSI-approved licenses. This includes scripts for data preprocessing, model training, hyperparameter tuning, and inference. The new definition also covers model parameters, such as weights, which must be shared at key training stages to facilitate reuse and improvement.
By opening up all these elements, the Open Source AI Definition aims to foster an ecosystem where modifications and innovations are easily shared, aligning with the principles of collaborative improvement and transparency. This requirement also reflects a shift toward making AI development more inclusive and reproducible.
Legal Flexibility and Future Implications
The Open Source AI Definition does not enforce a specific legal structure for the release of model parameters or data but encourages licenses that ensure free access. Over time, it expects the legal landscape to evolve to better accommodate these open principles in AI. The initiative signals a growing recognition that AI models should not be considered fully open-source unless they offer comprehensive access to training data, source code, and parameters.
By integrating data transparency into its core principles, this definition sets a new precedent for open AI systems, advocating that both the models and the data driving them must be open to inspection and modification. The effort marks a significant step in ensuring that open-source principles extend beyond software into the realm of AI, enabling innovation without the constraints of proprietary systems.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.