Dreamix, Amazon AI Bedrock, X.AI, AI21 Jurassic-1 Models, Grounded SAM, SEEM, & More

Created on April 18|Last edited on April 19
Comment
﻿
DreamixGoogle and researchers at the Hebrew University of Jerusalem released Dreamix, a text-to-video diffusion model. It felt like just yesterday that I mentioned video diffusion models in a blog post and here we are! 
Their model is based on Imagen-Video, a cascade combination of models:
a T5 XXL encoder 
a video diffusion model conditioned on text that generates 16 frames of 40x24 RGB videos or (16, 24, 40, 3)
3 spatial super-resolution diffusion models and 3 temporal super-resolution diffusion models
﻿
They trained their video-based diffusion model on a joint objective where the model must, given a downsampled and noisy version of the ground truth video, recreate the video exactly and recreate every frame of the video. This joint objective allows the model to learn both the appearance of objects within an image and the motion associated with it.
﻿
The inference pipeline allows for video editing, image-to-video, and subject-driven video generation with all 3 applications conditioned on text.
﻿
Amazon AI BedrockAmazon recently released Bedrock, a Generative AI cloud provider. They provide a diverse suite of foundational models:
﻿
Notice they are also training a foundation model of their own called Titan!
Elon Musk founds X.AIThough Elon signed a 6 month letter calling for a pause, Elon still plans to compete against OpenAI with what he calls 'TruthGPT', a maximum-truth-seeking LLM, backed by the company he founded: X.AI. 
AI21 Jurassic-1 ModelsAI21, an NLP research company, released AI21 Studio, a platform for using their models,, and Jurassic-1, their family of LLMs. If you do create an AI21 Studio account, definitely don't forget to leverage the free $90 credits! From the home page, they have an interface for uploading models and datasets as well as task-specific API docs and examples. Check out their technical paper on their Jurassic-1 models here!
﻿
You can also check out an OpenAI-like query interface under the Playground tab. 
﻿
Unlike OpenAI's GPT-3, they utilized a recent theory about the depth and width tradeoff of LLMs, favoring 76 layers instead of 96. They also have a 256K vocabulary, much larger than other LLMs! The difference in tokenizer efficiency and architecture (depth vs width) allowed for a significant boost in inference. However, this is only on their Jurassic-1 models. As of this blog post, Jurassic-2 models are already out. 
Grounded Segment AnythingFor context, Grounding DINO is a strong zero-shot object detector conditioned on user text and Segment Anything Model (SAM) from Meta is a versatile segmentation model. 
They combine Grounding DINO with Meta's SAM to produce a text-conditioned segmentation system! Check out their HuggingFace Space Demo here and their interactive demo here!
Here I am segmenting out an image of a giraffe, pig, and a book! I found it interesting that it could identify the giraffe and pig despite their drastic differences from real giraffes and pigs, but not the laptop. 
﻿
Segment Everything Everywhere All at Once (SEEM)SEEM is a SAM-based segmentation model with multi-modal input.
﻿
﻿
﻿
SEEM comprises of a decoder, a mask predictor, and a concept classifier. The figure on the right (b) simply denotes the human-AI interaction loop. The system on the left details the SEEM model and is meant to be read from bottom to top. There are encoders to encode text, image, and other visuals (and even audio!). A few interesting things to note:
encoders encode text, image, and other visuals but there are also some learnable queries 
there is a memory prompt which is responsible for encoding masks from previous iterations
Click here for the HuggingFace Space demo!
Meta's DINOv2Building on the original DINO paper, DINOv2 captures visual features even better. They released their model on GitHub. 
DINOv2, in contrast to many other modern methods for encoding visual features, does not need to rely on fine-tuning to perform well. With the recent gradual slow in improvement in joint embedding models (like image and text), to improve upon DINO, Meta researchers had to construct a better dataset, improve the training algorithm, and design a distillation pipeline.
To build their dataset, they collected a set of seed images which were then matched to similar images to balance the distribution of concepts.
They used PyTorch 2.0 with xFormers.
Released ViT-S/B/L/g (small, big, large, giant)
Other Interesting Topics & Papers﻿DETR Beats YOLO in Real-Time Object Detection﻿Their model, RT-DETR, outperforms YOLO in both speed and accuracy and will be eventually integrated into PaddleDetection!
﻿AGIEval﻿AGIEval is one large benchmark of tests and tasks for AGI models like GPT-4 to take on. 
﻿One Small Step for Generative AI, One Giant Leap for AGI﻿This survey paper analyzes the technology, applications, challenges, and limitations of ChatGPT and emerging early AGI models like GPT-4. 
ReferencesDreamix
﻿https://dreamix-video-editing.github.io/﻿
﻿https://arxiv.org/pdf/2302.01329.pdf﻿
Amazon AI Bedrock
Mok, Aaron. “Amazon Announces 'Bedrock' Ai Platform to Take on OpenAI.” Business Insider, Business Insider, 13 Apr. 2023.
﻿https://aws.amazon.com/bedrock/﻿
X.AI
Paris, Martine. “Elon Musk Launches X.AI to Fight CHATGPT Woke AI, Says Twitter Is Breakeven.” Forbes, Forbes Magazine, 16 Apr. 2023.
Roth, Emma. “Elon Musk Claims to Be Working on 'Truthgpt' - a 'Maximum Truth-Seeking AI'.” The Verge, The Verge, 17 Apr. 2023.
AI21
﻿https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1﻿
﻿https://assets.website-files.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf﻿
Grounded Segment Anything
﻿https://segment-anything.com/﻿
﻿https://github.com/IDEA-Research/GroundingDINO﻿
﻿https://github.com/IDEA-Research/Grounded-Segment-Anything﻿
﻿https://huggingface.co/spaces/IDEA-Research/Grounded-SAM﻿
﻿https://huggingface.co/spaces/yizhangliu/Grounded-Segment-Anything﻿
Segment Everything Everywhere All at Once
﻿https://arxiv.org/pdf/2304.06718.pdf﻿
﻿https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once﻿
﻿https://huggingface.co/spaces/xdecoder/SEEM﻿
DINOv2
﻿https://github.com/facebookresearch/dinov2﻿
﻿https://arxiv.org/abs/2304.07193﻿
﻿https://dinov2.metademolab.com/﻿
﻿https://ai.facebook.com/blog/dino-v2-computer-vision-self-supervised-learning/﻿
﻿https://github.com/facebookresearch/xformers﻿
DETR Beats YOLO in Real-Time Object Detection
﻿https://arxiv.org/pdf/2304.08069.pdf﻿
One Small Step for Generative AI, One Giant Leap for AGI
﻿https://arxiv.org/pdf/2304.06488.pdf﻿
AGIEval
﻿https://arxiv.org/pdf/2304.06364.pdf﻿
﻿https://github.com/microsoft/AGIEval﻿
﻿