Git-Theta: Git for ML Models

Meet Git-Theta — a solution for merging models just like we merge code on GitHub, which is a dramatic leap in the right direction for ML model version control.
Brett Young
Created on June 12|Last edited on June 13
Comment
Machine learning has dramatically reshaped our approach to data analysis and problem-solving, launching us into a new world of computational intelligence. Yet, the management and storage of machine learning models with traditional methods exhibit specific drawbacks. Contemporary tools such as DVC effectively monitor versions of a machine learning model. Still, they treat the model's checkpoint — essentially, saved parameter values — as an undifferentiated, large file.
Any modification to the model's parameters incurs the same communication and storage costs as a total model overhaul, creating impediments to efficient model management, especially for collaborative models. Here we introduce new research on a project called Git-Theta, which utilizes the fundamental principles of Git to improve ML model version control greatly. 
What's Git? Git has played a significant role in the evolution of software development over the few decades, converting it from an unstructured practice into a systematic discipline ensuring software that is efficiently and correctly implemented. It is a version control system that offers detailed tracking of software changes, enabling collaborative software development by seamlessly integrating changes from multiple programmers. Git's strength lies in its ability to avoid unnecessary storage, ensuring efficiency, and providing a robust system to manage codebases. 
The Breakthrough However, unlike traditional files that can be updated, machine learning models have typically been replaced rather than revised. This is where recent research from organizations such as IBM has begun to alter how we think about ML workflows and version control. It has been discovered that machine learning models can be 'fused', essentially combining parameters from multiple models without requiring retraining. 
Amazingly, these fused models have demonstrated improved performance when compared to their unfused counterparts. Git-Theta leverages these insights, applying the principles of Git to the storage and management of machine learning models, which could lead to significant advancements in their collaborative and ongoing development.
Parameter Tracking Git-Theta circumvents the issues associated with using Git or Git Large File Storage (LFS) to manage machine learning models by offering a more specialized tool specifically for tracking these models rather than simply replacing them. It taps into the inherent structure of model checkpoints to address the limitations of traditional Git and Git LFS. 
By considering the parameter groups that make up a checkpoint (such as weight matrices or bias vectors in neural networks), Git-Theta substantially increases storage efficiency. It applies the concept of snapshots at the parameter group level, only storing new values when a change is detected in a group. Unchanged values are referenced to their prior versions, cutting down storage usage. 
In instances where two checkpoints differ, Git-Theta offers much more useful diff information, indicating which parameter groups have actually been altered. Furthermore, Git-Theta can handle multiple different histories (for instance, from different contributors) efficiently, ignoring parameter groups that are equivalent across histories and applying user-specified merge operations to those that are different.
Hashing? Traditional Git employs a technique called hashing to manage files efficiently. It uses a hash function, specifically SHA-1, to generate a unique identifier, or hash, for each file or directory it oversees. When Git checks for changes in a repository, it compares these hash values instead of the actual file content. 
This comparison makes the process efficient as it quickly identifies modifications between versions. The hashing technique also ensures data integrity, as any corruption or alteration in the file results in a different hash, indicating that the content is not in its original state.
However, this approach of hashing does not serve well when it comes to comparing machine learning models. In the context of ML models, parameters can have slight variations due to reasons like numerical noise from different linear algebra implementations or minute differences introduced by different machines or library versions. 
Such tiny changes, while not typically affecting ML results, can cause hash mismatches when using conventional hashing methods. This means that traditional Git would detect these as changes, leading to inefficient storage and potential confusion.
LSH Hashing To tackle this challenge, Git-Theta employs a specialized form of hashing called Locality Sensitive Hashing (LSH). LSH hashes similar items to the same value, providing probabilistic bounds on false positives. It allows Git-Theta to determine if a parameter group in a model has changed without loading both the current and previous versions into memory. 
This ensures that tiny changes due to noise don't result in hash mismatches, making it a more suitable and efficient approach to track and manage changes in machine learning models.
Merging Models Merging, in the context of Git-Theta, refers to the process of reconciling two branches that have diverged. Typically, in a traditional Git workflow, this involves combining code changes made on two different branches. However, with Git-Theta, the branches not only contain code, but also machine learning models with unique parameter sets. Merging, then, can become a complex task as it may involve intelligently combining these parameter sets.
Merging the RTE and ANLI branches increases performance on RTE dataset!
﻿
To facilitate merging, Git-Theta supports Merge plug-ins. These plug-ins represent different strategies for combining two versions of the same parameter group from different branches. For instance, when two branches have modifications to the same parameter group, the Git-Theta merge driver builds a dynamic menu of merge strategies based on the available Merge plug-ins. 
Each Merge plug-in provides a summary of its operations, a keyword to select its strategy, and the kind of merge conflicts it can resolve. This allows the merge driver to compile a menu with only the relevant plug-ins.
Git-Theta supports a variety of Merge plug-ins that resolve conflicts in different ways. Some plug-ins resolve conflicts by using changes from the current branch, changes from the other branch, or changes from their common ancestor. There's also an option to resolve conflicts by averaging the parameters from each branch. 
Overall Overall, the design and implementation of Git-Theta leverage the power of traditional Git while addressing the specific challenges posed by machine learning models. By introducing efficient parameter hashing, intelligent merging strategies, and a plug-in system for customizability, Git-Theta makes model versioning way more efficient and flexible! 
The Paper: https://arxiv.org/pdf/2306.04529v1.pdf﻿
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.