Initializing Models with Larger Ones

A brilliant new technique for extending the capabilities from large to small models
Created on December 4|Last edited on December 4
Comment
The research presented in the paper "Initializing Models with Larger Ones" is a collaborative effort by the University of Pennsylvania, UC Berkeley, MBZUAI, and Meta AI Research. The method introduced focuses on initializing smaller models using a subset of weights from a pretrained larger model. This process enables the transfer of knowledge from larger, pretrained models to smaller models, aiming to enhance the performance and reduce the training time of these smaller models. The approach is particularly beneficial in resource-constrained settings, offering a new method to leverage the capabilities of large pretrained models for training smaller models.
Initialization is Critical The paper highlights the importance of weight initialization in optimizing neural networks. Traditionally, Xavier and Kaiming initializations were standard practices, particularly for networks trained from scratch. However, with the advent of pretrained models, fine-tuning these models has become a preferred approach. The weight selection method proposed in this paper builds upon this concept, using pretrained models as a foundation for initializing smaller models within the same family.
﻿
Experiments were conducted using nine image classification datasets with varying scales, including ImageNet-1K, CIFAR-10, CIFAR-100, and others. The models used for these experiments were ViT-T/16 and ConvNeXt-F, with ImageNet-21K pretrained ViT-S/16 and ConvNeXt-T serving as their "teachers" respectively. This study aimed to showcase the effectiveness of the weight selection method compared to standard training practices.
Results Results from these experiments indicated that weight selection consistently improved test accuracy across different datasets, particularly for smaller datasets. This method was also effective in reducing training time, where the same performance could be achieved with only one-third of the epochs needed compared to training from random initialization. Furthermore, weight selection was found to be more effective than classic initialization methods like Xavier and Kaiming initializations, and it worked well in conjunction with knowledge distillation techniques.
﻿
﻿
Further analysis compared weight selection with transfer learning and found that weight selection was significantly faster in achieving comparable accuracy without needing access to the dataset used for pretraining. The study also evaluated the impact of using different pretrained models as teachers, concluding that models with supervised pretraining were the most effective. Additionally, the method of layer selection within the weight selection process was scrutinized, revealing that first-N layer selection outperformed uniform layer selection.
Comparisons with other methods such as pruning and mimetic initialization further demonstrated the superiority of weight selection. It maintained its advantages even with longer training durations and was shown to be robust under stronger training recipes. Linear probing was used to assess the raw model's capability as a feature extractor, confirming that weight selection significantly improved feature extraction capabilities compared to random initialization. The study also highlighted the importance of including all components from pretrained models for optimal initialization. 
Possible Idea Although not mentioned in the paper, one could possibly extrapolate a similar idea to larger models, envisioning a scenario where layers from different models are combined. For instance, consider integrating the MLP layers from a model like Mistral-7B with the self-attention layers of Llama-2. This approach could potentially create a hybrid model that leverages the unique strengths of each component.
Such a blend might offer interesting and efficient capabilities, potentially bypassing the need for extensive fine-tuning. By selectively combining layers from various pretrained models, there’s a possibility to harness their individual efficiencies and strengths, potentially leading to novel functionalities or enhanced performance (I am not aware of anyone trying this, so take this idea with a grain of salt).  
In conclusion, the paper presents weight selection as a promising method for efficiently training smaller models by leveraging the knowledge encapsulated in larger, pretrained models. This approach opens up new possibilities for enhancing the performance of smaller models, especially in settings where resources are limited.
﻿
﻿
The Paper:
﻿https://arxiv.org/pdf/2311.18823v1.pdf﻿
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.