Skip to main content

Dreamix, Amazon AI Bedrock, X.AI, AI21 Jurassic-1 Models, Grounded SAM, SEEM, & More

Created on April 18|Last edited on April 19

Dreamix

Google and researchers at the Hebrew University of Jerusalem released Dreamix, a text-to-video diffusion model. It felt like just yesterday that I mentioned video diffusion models in a blog post and here we are!
Their model is based on Imagen-Video, a cascade combination of models:
  • a T5 XXL encoder
  • a video diffusion model conditioned on text that generates 16 frames of 40x24 RGB videos or (16, 24, 40, 3)
  • 3 spatial super-resolution diffusion models and 3 temporal super-resolution diffusion models

They trained their video-based diffusion model on a joint objective where the model must, given a downsampled and noisy version of the ground truth video, recreate the video exactly and recreate every frame of the video. This joint objective allows the model to learn both the appearance of objects within an image and the motion associated with it.

The inference pipeline allows for video editing, image-to-video, and subject-driven video generation with all 3 applications conditioned on text.


Amazon AI Bedrock

Amazon recently released Bedrock, a Generative AI cloud provider. They provide a diverse suite of foundational models:

Notice they are also training a foundation model of their own called Titan!

Elon Musk founds X.AI

Though Elon signed a 6 month letter calling for a pause, Elon still plans to compete against OpenAI with what he calls 'TruthGPT', a maximum-truth-seeking LLM, backed by the company he founded: X.AI.

AI21 Jurassic-1 Models

AI21, an NLP research company, released AI21 Studio, a platform for using their models,, and Jurassic-1, their family of LLMs. If you do create an AI21 Studio account, definitely don't forget to leverage the free $90 credits! From the home page, they have an interface for uploading models and datasets as well as task-specific API docs and examples. Check out their technical paper on their Jurassic-1 models here!

You can also check out an OpenAI-like query interface under the Playground tab.

Unlike OpenAI's GPT-3, they utilized a recent theory about the depth and width tradeoff of LLMs, favoring 76 layers instead of 96. They also have a 256K vocabulary, much larger than other LLMs! The difference in tokenizer efficiency and architecture (depth vs width) allowed for a significant boost in inference. However, this is only on their Jurassic-1 models. As of this blog post, Jurassic-2 models are already out.

Grounded Segment Anything

For context, Grounding DINO is a strong zero-shot object detector conditioned on user text and Segment Anything Model (SAM) from Meta is a versatile segmentation model.
They combine Grounding DINO with Meta's SAM to produce a text-conditioned segmentation system! Check out their HuggingFace Space Demo here and their interactive demo here!
Here I am segmenting out an image of a giraffe, pig, and a book! I found it interesting that it could identify the giraffe and pig despite their drastic differences from real giraffes and pigs, but not the laptop.


Segment Everything Everywhere All at Once (SEEM)

SEEM is a SAM-based segmentation model with multi-modal input.



SEEM comprises of a decoder, a mask predictor, and a concept classifier. The figure on the right (b) simply denotes the human-AI interaction loop. The system on the left details the SEEM model and is meant to be read from bottom to top. There are encoders to encode text, image, and other visuals (and even audio!). A few interesting things to note:
  • encoders encode text, image, and other visuals but there are also some learnable queries
  • there is a memory prompt which is responsible for encoding masks from previous iterations
Click here for the HuggingFace Space demo!

Meta's DINOv2

Building on the original DINO paper, DINOv2 captures visual features even better. They released their model on GitHub.
DINOv2, in contrast to many other modern methods for encoding visual features, does not need to rely on fine-tuning to perform well. With the recent gradual slow in improvement in joint embedding models (like image and text), to improve upon DINO, Meta researchers had to construct a better dataset, improve the training algorithm, and design a distillation pipeline.
  • To build their dataset, they collected a set of seed images which were then matched to similar images to balance the distribution of concepts.
  • They used PyTorch 2.0 with xFormers.
  • Released ViT-S/B/L/g (small, big, large, giant)

Other Interesting Topics & Papers

DETR Beats YOLO in Real-Time Object Detection

Their model, RT-DETR, outperforms YOLO in both speed and accuracy and will be eventually integrated into PaddleDetection!

AGIEval

AGIEval is one large benchmark of tests and tasks for AGI models like GPT-4 to take on.

One Small Step for Generative AI, One Giant Leap for AGI

This survey paper analyzes the technology, applications, challenges, and limitations of ChatGPT and emerging early AGI models like GPT-4.

References

Dreamix
Amazon AI Bedrock
X.AI
AI21
Grounded Segment Anything
Segment Everything Everywhere All at Once
DINOv2
DETR Beats YOLO in Real-Time Object Detection
One Small Step for Generative AI, One Giant Leap for AGI
AGIEval
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.