Skip to main content

CoCa CLIP loss conditioning

Created on March 20|Last edited on April 10

020k40k60k80k100k120k140kstep51015202530
020k40k60k80k100k120k140kstep20406080100
020k40k60k80k100k120k140kstep0.00020.00040.00060.0008
020k40k60k80k100k120k140kstep246810
020k40k60k80k100k120k140kstep5101520
020k40k60k80k100k120k140kstep50001000015000200002500030000
Run set
5


Results

Model trained from epoch 60->76 with CLIP similarity conditioning. Here's how it compares to the normal epoch 76 model:

Normal Zero-Shot Imagenet:
imagenet-zeroshot-val-top1: 0.7182 imagenet-zeroshot-val-top5: 0.9271
improves to
imagenet-zeroshot-val-top1: 0.7234 imagenet-zeroshot-val-top5: 0.9284
MSCOCO caption generation:
{"dataset": "mscoco_captions", "model": "coca_ViT-L-14", "pretrained": "/fsx/iejmac/open_clip_dev/open_clip/src/logs/adpt_coca/epoch_76.pt", "task": "mscoco_generative", "metrics": {"Bleu_1": 0.3084563745235632, "Bleu_2": 0.1880991219962997, "Bleu_3": 0.1151609364894224, "Bleu_4": 0.07211067085076828, "METEOR": 0.1212304498145956, "ROUGE_L": 0.2605558512821119, "CIDEr": 0.34406495909559154, "SPICE": 0.09105146670053557}, "language": "en"}
improves to
{"dataset": "mscoco_captions", "model": "coca_ViT-L-14", "pretrained": "/fsx/iejmac/open_clip_dev/open_clip/src/logs/adpt_coca/adpt_epoch_76.pt", "task": "mscoco_generative", "metrics": {"Bleu_1": 0.3239379083831527, "Bleu_2": 0.1973981451426982, "Bleu_3": 0.12121751751996586, "Bleu_4": 0.07490514239710015, "METEOR": 0.12607178324939913, "ROUGE_L": 0.2671355955745214, "CIDEr": 0.3552257461803831, "SPICE": 0.09305231270489032}, "language": "en"}