Bigger Models = Less Cost?
Researchers find a way to reduce CLIP training by 25x
Created on May 19|Last edited on May 19
Comment
Contrastive Language-Image Pre-training (CLIP) has shown considerable promise in many applications, thanks to its ability to learn from rich image-text pairings. However, the substantial training costs associated with CLIP have been a major obstacle to its wider application. In a bid to make the training of CLIP models more efficient and accessible, a team of researchers from UC Santa Cruz has been delving deep into CLIP training, leading to some impressive findings.
Previous Methods
Previously, efforts to make CLIP training more accessible have included the OpenCLIP project and the release of the LAION-400M and LAION-5B datasets. While these efforts have made valuable contributions, the associated training cost with CLIP remains high. For instance, replicating OpenCLIP-B/32’s 62.9% zero-shot top-1 ImageNet accuracy requires a staggering 36 hours of training with 128 A100 GPUs. In response to this challenge, researchers have sought ways to increase training efficiency.
Image Masking
One strategy employed for improving efficiency is random masking, which was used in models like FLIP. This strategy involves randomly excluding a portion of the image tokens during the training process, forcing the model to learn from incomplete inputs. In addition to random masking, researchers also investigated grid masking and block masking, strategies that preserve certain image structures or patches, proving beneficial for CLIP's objectives of maximizing information extraction. In masked image modeling, the objective is to generate absent information from a masked input. Therefore, strategies like random masking that effectively minimize retained information are preferred. In contrast, CLIP training aims to maximize information extraction from the input to achieve better discrimination between different samples. This insight led to a more direct solution: resizing the images while retaining full information. This simple resizing strategy surpasses all different masking strategies, which seems to point towards the importance of retaining full input information during CLIP training.

Text Masking
For text inputs, the team suggested syntax masking as an improvement over random masking. Syntax masking assigns different masking priorities to parts of speech, prioritizing nouns, then adjectives, and then other words. This method showed impressive results, especially when the maximum text token lengths were extremely short. Similar to image resizing, this method is efficient while also preserving information in the data, which is essential for CLIP.


Inverse Scaling
The main finding of this study was an "inverse scaling law" - the larger the model, the fewer image or text tokens required during training, with only a minor impact on performance. This is an interesting observation that larger models can perform comparably well while using fewer input tokens in comparison to their smaller counterparts. This inverse scaling law holds true for both image and text tokens, indicating a potential way to increase training efficiency.


The researchers used a larger model than previously tested, while combining syntax masking for text and image resizing, which resulted in a 25x speedup in compute requirements while achieving competitive performance. The results are encouraging for researchers and practitioners interested in leveraging CLIP models. By demonstrating that larger models require fewer input tokens during training, and by identifying effective strategies for reducing token length, the researchers have forged a path towards more efficient and cost-effective use of CLIP models. This study will hopefully open doors for more researchers to utilize CLIP, broadening the scope of its applications, and accelerating future innovation for multimodal models!
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.