Skip to main content

Compressed Vision: DeepMind's AI Pipeline For Memory-Efficient Video Compression

DeepMind researchers have shown off Compressed Vision, a new pipeline for compressing visual data with AI, showing massive memory efficiency gains.
Created on October 17|Last edited on October 17
Visual classification is a classic application of machine learning, but handling visual data remains a cumbersome problem, even today. Most machine learning models are limited to working with short video clips because, even with standard compression methods, it's hard to fit very much data on the GPU.
Today, researchers at DeepMind reveal Compressed Vision, a framework for compressing visual data not through standard algorithms, but through machine learning algorithms.
Compressed Vision allows for visual data to be compressed all the way through to end model input, including augmentation steps, allowing for significantly more visual data to be fed to the model and GPU.

Compressed Vision also comes with a research paper released earlier this month.

Visual data compression for AI

Compressed Vision was made to be an efficient solution for handling visual data for machine learning workflows. It is split into two parts: Initial compression and downstream tasks. The efficiency of this pipeline comes from the fact that once visual data is compressed, it stays compressed through to the end, unlike the standard approach to working with visual data.


Compressing visual data with AI vs standard methods

The first and most important part of Compressed Vision is the actual data compression portion. Here, input data is served into the compression model which transforms the data into a significantly smaller file size, which is then stored to be used later by other machine learning models.
Storing data can be very important when you're working with extremely large datasets, so compressing it well is ideal. With Compressed Vision's compression, dataset data can be made extremely small.
The compressed data, because it's built by and for ML models, can also be sent directly to the GPU as model input, allowing for significantly longer videos to be used without worrying about memory limitations.
To ensure the compressed data is of high quality, comparisons were made against JPEG and MPEG encodings at various compression rates. They found that the visual quality of the data was significantly higher than that of JPEG and MPEG at high compression rates, however, the actual decompression process took much longer. Thankfully, there are no decompression steps necessary in the application, because downstream ML models will be able to work with the compressed data directly.
Here you can see a comparison of neural, MPEG, and JPEG encodings at a compression rate of 180:

Not only are extremely high compression rates with MPEG and JPEG very hard to parse with our human eyes, but feeding that data into a machine learning model could yield bad results.
Here, neural compression is applied at 30 and 475 compression rates compared to the baseline:

At low compression rates, the difference is indistinguishable, however at very high compression rates you start to lose fine details. For machine learning models, those fine details are often not the focus of attention, so extremely high compression rates are perfectly feasible for application.

Using neurally-compressed data for downstream tasks

The second piece of Compressed Vision is where the compressed data is applied for real tasks, like object or motion classification. These downstream tasks are effectively their own models made to work with the compressed data that the compressor produces.
Using standard video encoding during model runtime limits the amount of data that can be processed in any given step because visual data, especially video, takes up a lot of memory. Many models today only work with video clips that are only a few seconds long, but when using the neurally-compressed data, the researchers could send entire hour-long videos through the model.

Augmenting neurally-compressed data

One key aspect of visual data's use in machine learning is the ability to modify it before it's sent into the model. For training tasks, this is very valuable because you can augment a single element of a dataset into many variations, effectively multiplying the potential of any dataset.
Visual augmentation steps in standard setups rely on typical image modification algorithms which all work perfectly for JPEG, MPEG, or many other standard file types, but you can't use them with our neurally-compressed data. That's why the researchers wanted to make sure they could create a model to augment their special compressed data - and they were able to.
Here's an example of increased saturation, visual crop, and rotation:

Remember, all these transformations are done with the compressed data in the neural latent space, it's all learned and applied by AI models; no hand-coded visual transformation programs are here.

Find out more

Visit the Compressed Vision project web page for more examples and information. Models and the codebase will also be coming soon according to a notice on the page.
Read the full paper for all the details on Compressed Vision.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.