ViT-g-14

Created on November 17|Last edited on December 12
Comment
For our EVA-CLIP, we initialize the vision encoder via MIM pre-trained EVA and the language encoder from OpenAI CLIP-L.
The pre-training implementation is based on Open CLIP.
We also adopt DeepSpeed optimization library with ZeRO stage-1 optimizer to save memory.
Unlike Open CLIP,  we find using fp16 format with dynamic loss scaling is stable enough during the whole course of training while using bfloat16 format is unnecessary.
These modifications allow us to train a 1.1B CLIP with a batch size of 41k on 256 NVIDIA A100 40GB GPUs.
﻿
One epoch is 1/10 of total sample of LAION-400M.
model config:
{
    "embed_dim": 1024,
    "vision_cfg": {
        "image_size": 224,
        "layers": 40,
        "width": 1408,
        "head_width": 88,
        "mlp_ratio": 4.3637,
        "patch_size": 14,
        "drop_path_rate": 0.4
    },
    "text_cfg": {
        "context_length": 77,
        "vocab_size": 49408,
        "width": 768,
        "heads": 12,
        "layers": 12
    }
}
Eval metrics﻿
Run set9
﻿
Training metrics﻿
Run set9
﻿
﻿
Add a comment