Deploying efficient neural nets on mobiles and edge devices is becoming increasingly important. To achieve this, neural networks are "quantized" to use low-precision arithmetic, since it improves storage and computational costs as well as improves latency. It might come with a trade-off of accuracy, but recently image classification and natural language processing models have been successful.
However, the quantized models for image reconstruction lead to a noticeable loss in image quality. This report will explore QW-Net, an architecture for image reconstruction in which close to 95% of the computation can be implemented with only 4-bit integer.
If you are a PC gamer, you must have seen pixelated edges. This is referred to as "jaggies". Mood spoiler, right! Well, this is caused by aliasing. Aliasing is a fundamental concept in signal processing and exists any time we discretely sample a signal at a rate below the Nyquist rate for that signal.
-> Figure 1: Right jagged lines, center the red polygon being drawn on the screen with square pixels, and left what rendered on the screen. (Source) <-
In computer graphics, there are many ways aliasing manifests in practice. Rasterization aliasing is one of them and is widely known as the jagged lines of a triangle, as shown in figure 1. It happens when we rasterize an image onto a 2D grid of pixels. Each pixel can only have one color, so the GPU selects the color that is in the very center of the pixel as each shape is drawn. When the computer renders this geometry, the center of each pixel is either in or not in, thus the sawtooth pattern. Check out this excellent blog post to learn more about aliasing.
So how to counter it? Anti-aliasing is a method by which we can eliminate jaggies that appear in objects in PC games. There are several different types of anti-aliasing methods, each with its pros and cons, but each aims to create sharper images by removing jaggies. The term antialiasing broadly encompasses prefiltering, sampling, and reconstruction techniques that seek to avoid or remove undersampling artifacts.
Supersampling is a straightforward method to overcome but at the cost of increased rendering. Multisampling antialiasing(MSAA) is a cheaper alternative but less commonly used in modern games. Temporal antialiasing (TAA) is a more recent family of techniques that leverage frame-to-frame coherence to amortize supersampling over time. However, it is susceptible to ghosting artifacts, loss of details, and temporal instability, to name a few.
We can consider "jaggies" like noise. We can learn a convolutional neural network-based model for denoising images. UNET based architecture is well suited for such tasks as it processes features at multiple scales. Also, note that game frames are sequential, leveraging to demonstrate temporally stable results by introducing recurrent connections inside UNET. Recent works in this space can reconstruct images at a higher resolution than the input render. However, these methods are either not real-time on moderate GPU or do not have an open-sourced code.
The authors of a reduced-precision network for image reconstruction have proposed a novel architecture that uses a combination of two U-shaped networks, feature extractor network, and filtering network, to achieve high quality, temporally stable results with aggressive quantization.
State of the art machine learning models is often bulky, making them inefficient for deployment in resource-constrained environments. They are also called full-precision neural networks and uses float32
datatype as a convention for arithmetic. It turns out that DNNs can work with smaller datatypes with less precision, such as 8-bit integers.
A quantized model executes some or all of the operations on tensors with integers rather than floating-point values. A full-precision network can be quantized either post-training or by training the network with simulated quantization. Post-training quantization shows significant degradation of accuracy. Simultaneously, quantization aware training has shown promise to achieve significantly better accuracy but at the cost of longer training time. Sayak Paul covers an excellent comparison of both techniques in A Tale of Model Quantization in TF Lite.
There are benefits of quantization:
Low precision arithmetic is fast compared to floating-point arithmetic. Even though modern hardware can compute with floating-point numbers at the same speed as integer numbers. An 8-bit computation will always be faster than a 32-bit computation. It improves the latency of the game rendering systems allowing real-time image reconstruction.
A quantized model has a significantly low memory footprint.
Real-time game image reconstruction can be achieved with GPU acceleration and model quantization. However, quantization errors severely impact image quality, especially with high-dynamic-range content. The authors aim to show the feasibility of a heavily quantized network for image reconstruction.
-> Figure 2: Overview of the proposed QW-Net architecture <-
The proposed QW-Net architecture addresses the issue using two U-shaped networks a feature extraction network, and a filtering network. For a temporally stable result, the frame-recurrent approach is used, where the previously reconstructed frame is warped and concatenated with the input frame, forming the current input to the network. Let the feature extractor be $U_e$ and filter network be $U_f$, the reconstructed output $I_o^k$ at frame k is given by,
$I_o^k = U_f(U_e(I_a^k, I_w^k), I_a^k, I_w^k)$
$I_w^k = W(I_o^{k-1}, I_v^k)$
Here, $I_a^k$ is aliased input, $I_w^k$ is the warped previous output, $I_v^k$ is a 2D grid of motion vectors, and $W$ is a bilinear warp function.
-> Figure 3: Components of the feature extractor network <-
This is based on U-Net architecture and includes encoder blocks to downsample image while decoder blocks reverse this process. Skip connections are used. This is shown in figure 2.
The input processing unit takes in $I_a^k$ and $I_w^k$. It converts each of them to grayscale and computes their gradient magnitudes. They are concatenated with $I_a^k$ and $I_w^k$. This is the input to the first convolutional layer.
Each encoder block has two convolution layers with a 3×3 spatial footprint, each followed by batch normalization and Exponential Linear Unit (ELU) activation. The last stage in the encoder block is downsampling with 2 × 2 max pooling. The bottleneck has 160 feature depth.
The decoder block starts with a 2 × 2 nearest-neighbor upsampling operation. The upsampled activations are concatenated with the skip connection and projected to the same size as the encoder output using a 1 × 1 convolution layer. A single 3×3 convolution layer follows it.
The feature extractor can be quantized to 4-bit integers as feature detection is more robust to quantization errors.
-> Figure 4: Components of the filter network <-
This has a similar topology to the feature extraction network, as shown in figure 2. The pair of downsampling and upsampling filters at each scale are coupled to the corresponding decoder block's output in the feature extraction network.
The input filter uses activations from the decoder block to predict a 3x3 kernel that is applied to the input image. A 1 × 1 convolution layer with softmax activation is used to predict the kernel resulting in normalized weights. The input filter predicts 18 normalized filter weights corresponding to two 3 × 3 filters with 9 weights. These filters are applied to 𝑰𝒂 and 𝑰𝒘, respectively, and the results are summed to produce a single image. The subsequent downsampling filters apply a 3 × 3 kernel to a single image. The last stage in each downsampling filter is a 2 × 2 average pooling operation. The bottleneck filter excludes this pooling operation.
The first stage in each upsampling filter is bilinear upsampling, following which the image is filtered and combined with the skip connection. The upsampling filters use 10 filter weights, 9 weights for the 3 × 3 filter kernel, and one for scaling the skip connection. We use average pooling and bilinear upsampling in the filtering network as it results in better image quality. On the other hand, we use max-pooling and nearest neighbor upsampling in the feature extraction network as they are computationally cheaper and do not significantly impact feature extraction.
This network requires a higher precision but involves fewer computations than the feature extractor.
The network was trained on blocks of $N_y$ x $N_x$ x $N_y$ images with $N_x$ and $N_y$ are 256 spatial dimentions and $N_t$ is the number of frames. 8 frames are used per block.
The authors prepared the datasets from four cinematic scenes publicly available for UE4. These scenes are Zengarden, Infiltrator, Kite, and Showdown. They had 13712 blocks for training.
The authors used two loss functions - _spatial loss $L_s$ and temporal loss $L_t$
Spatial Loss: It's a regular $L_1$ loss modified to computed over $N_xN_y$ pixels and $N_t$ time steps.
Temporal Loss: The temporal loss is the mean absolute error in the temporal gradient and aims to achieve temporal stability.
The authors used the weighted sum of the two losses given by,
$L = 0.3L_s + 0.7L_t$
At each training iteration, a mini-batch of 64 blocks was used to get the reconstruction for each time step and the computed loss was backpropagated through all time steps. Ranger optimizer that combines Rectified Adam and Lookahead with default parameters and a learning rate of 0.0005.
The quantization approach targets GPU architectures that support accelerated tensor computations with 8-bit and 4-bit integers.
All layers of the feature extraction network are quantized to 4-bit weights and activations, except the first convolution layer, which uses 8-bit weights and 4-bit activations.
The network is trained with full precision and then the weights and activations are quantized by fine-tuning the network with simulated quantization. This improves model performance.
Only the 1x1 convolutional layer to predict filter kernels are quantized that too using 8-bit precision.
The goal of this report is to summarize the paper, making it more accessible for the readers. I have used lines from the paper at places because that was the best way to convey the information. I find this paper particularly challenging because of the source code's absence and my lack of understanding of the video rendering use case. What fascinated me was that the authors tried to apply quantization for image reconstruction and the fact that they did so with only 4-bit precision. That is harsh!
Another critical thing to note is that the official paper is full of minute details, making it worth going through. Please feel to share your thoughts about this work in the comment.