ViT-g-14
Created on November 17|Last edited on December 12
Comment
For our EVA-CLIP, we initialize the vision encoder via MIM pre-trained EVA and the language encoder from OpenAI CLIP-L.
Unlike Open CLIP, we find using fp16 format with dynamic loss scaling is stable enough during the whole course of training while using bfloat16 format is unnecessary.
These modifications allow us to train a 1.1B CLIP with a batch size of 41k on 256 NVIDIA A100 40GB GPUs.
model config:
{"embed_dim": 1024,"vision_cfg": {"image_size": 224,"layers": 40,"width": 1408,"head_width": 88,"mlp_ratio": 4.3637,"patch_size": 14,"drop_path_rate": 0.4},"text_cfg": {"context_length": 77,"vocab_size": 49408,"width": 768,"heads": 12,"layers": 12}}
Eval metrics
Run set
9
Training metrics
Run set
9
Add a comment