Overview: A Reduced-Precision Network for Image Reconstruction

This report explores a novel neural network architecture QW-Net which is a low-precision neural network for image reconstruction.
Ayush Thakur

Deploying efficient neural nets on mobiles and edge devices is becoming increasingly important. To achieve this, neural networks are "quantized" to use low-precision arithmetic, since it improves storage and computational costs as well as improves latency. It might come with a trade-off of accuracy, but recently image classification and natural language processing models have been successful.

Project Website $\rightarrow$

However, the quantized models for image reconstruction lead to a noticeable loss in image quality. This report will explore QW-Net, an architecture for image reconstruction in which close to 95% of the computation can be implemented with only 4-bit integer.

Introduction

Aliasing

If you are a PC gamer, you must have seen pixelated edges. This is referred to as "jaggies". Mood spoiler, right! Well, this is caused by aliasing. Aliasing is a fundamental concept in signal processing and exists any time we discretely sample a signal at a rate below the Nyquist rate for that signal.

aliasing.png

-> Figure 1: Right jagged lines, center the red polygon being drawn on the screen with square pixels, and left what rendered on the screen. (Source) <-

In computer graphics, there are many ways aliasing manifests in practice. Rasterization aliasing is one of them and is widely known as the jagged lines of a triangle, as shown in figure 1. It happens when we rasterize an image onto a 2D grid of pixels. Each pixel can only have one color, so the GPU selects the color that is in the very center of the pixel as each shape is drawn. When the computer renders this geometry, the center of each pixel is either in or not in, thus the sawtooth pattern. Check out this excellent blog post to learn more about aliasing.

Antialiasing

So how to counter it? Anti-aliasing is a method by which we can eliminate jaggies that appear in objects in PC games. There are several different types of anti-aliasing methods, each with its pros and cons, but each aims to create sharper images by removing jaggies. The term antialiasing broadly encompasses prefiltering, sampling, and reconstruction techniques that seek to avoid or remove undersampling artifacts.

Supersampling is a straightforward method to overcome but at the cost of increased rendering. Multisampling antialiasing(MSAA) is a cheaper alternative but less commonly used in modern games. Temporal antialiasing (TAA) is a more recent family of techniques that leverage frame-to-frame coherence to amortize supersampling over time. However, it is susceptible to ghosting artifacts, loss of details, and temporal instability, to name a few.

We can consider "jaggies" like noise. We can learn a convolutional neural network-based model for denoising images. UNET based architecture is well suited for such tasks as it processes features at multiple scales. Also, note that game frames are sequential, leveraging to demonstrate temporally stable results by introducing recurrent connections inside UNET. Recent works in this space can reconstruct images at a higher resolution than the input render. However, these methods are either not real-time on moderate GPU or do not have an open-sourced code.

The authors of a reduced-precision network for image reconstruction have proposed a novel architecture that uses a combination of two U-shaped networks, feature extractor network, and filtering network, to achieve high quality, temporally stable results with aggressive quantization.

What is the Quantization of Neural Network?

State of the art machine learning models is often bulky, making them inefficient for deployment in resource-constrained environments. They are also called full-precision neural networks and uses float32 datatype as a convention for arithmetic. It turns out that DNNs can work with smaller datatypes with less precision, such as 8-bit integers.

A quantized model executes some or all of the operations on tensors with integers rather than floating-point values. A full-precision network can be quantized either post-training or by training the network with simulated quantization. Post-training quantization shows significant degradation of accuracy. Simultaneously, quantization aware training has shown promise to achieve significantly better accuracy but at the cost of longer training time. Sayak Paul covers an excellent comparison of both techniques in A Tale of Model Quantization in TF Lite.

Why is it important here?

There are benefits of quantization:

Overview of the Proposed Architecture: QW-Net

Real-time game image reconstruction can be achieved with GPU acceleration and model quantization. However, quantization errors severely impact image quality, especially with high-dynamic-range content. The authors aim to show the feasibility of a heavily quantized network for image reconstruction.

quantunet.png

-> Figure 2: Overview of the proposed QW-Net architecture <-

The proposed QW-Net architecture addresses the issue using two U-shaped networks a feature extraction network, and a filtering network. For a temporally stable result, the frame-recurrent approach is used, where the previously reconstructed frame is warped and concatenated with the input frame, forming the current input to the network. Let the feature extractor be $U_e$ and filter network be $U_f$, the reconstructed output $I_o^k$ at frame k is given by,

$I_o^k = U_f(U_e(I_a^k, I_w^k), I_a^k, I_w^k)$

$I_w^k = W(I_o^{k-1}, I_v^k)$

Here, $I_a^k$ is aliased input, $I_w^k$ is the warped previous output, $I_v^k$ is a 2D grid of motion vectors, and $W$ is a bilinear warp function.

Feature Extractor

fetureextractor.png

-> Figure 3: Components of the feature extractor network <-

Filter Network

filter.png

-> Figure 4: Components of the filter network <-

Training and Quantization

The network was trained on blocks of $N_y$ x $N_x$ x $N_y$ images with $N_x$ and $N_y$ are 256 spatial dimentions and $N_t$ is the number of frames. 8 frames are used per block.

Dataset

The authors prepared the datasets from four cinematic scenes publicly available for UE4. These scenes are Zengarden, Infiltrator, Kite, and Showdown. They had 13712 blocks for training.

Loss function

The authors used two loss functions - _spatial loss $L_s$ and temporal loss $L_t$

The authors used the weighted sum of the two losses given by,

$L = 0.3L_s + 0.7L_t$

Training

At each training iteration, a mini-batch of 64 blocks was used to get the reconstruction for each time step and the computed loss was backpropagated through all time steps. Ranger optimizer that combines Rectified Adam and Lookahead with default parameters and a learning rate of 0.0005.

Quantization

Conclusion

The goal of this report is to summarize the paper, making it more accessible for the readers. I have used lines from the paper at places because that was the best way to convey the information. I find this paper particularly challenging because of the source code's absence and my lack of understanding of the video rendering use case. What fascinated me was that the authors tried to apply quantization for image reconstruction and the fact that they did so with only 4-bit precision. That is harsh!

Another critical thing to note is that the official paper is full of minute details, making it worth going through. Please feel to share your thoughts about this work in the comment.