Skip to main content

Dynamic Sky Replacement: The Sky Is Within Our Grasp!

This article explores an interesting paper called Castle in the Sky: Dynamic Sky Replacement and Harmonization in Videos.
Created on November 18|Last edited on November 23
You might have used Instagram or Snapchat filters. Cat whiskers that look cute and vampire teeth that look spooky. Have you ever wondered how it's possible? Well, a short technical answer is that facial landmarks are automatically detected, and components like whiskers are placed on the appropriate landmarks. Interesting!
How about replacing the sky with something of your choice? You want the video that you shot on a cloudy day to have the warmth of a sunny day. Alternatively, you want a cool thunderstorm effect in the background. In this article, we will explore a technique that enables dynamic sky replacement and harmonization.

Project Website | Paper | Colab Notebook



Run set
1


Table of Contents




An Introduction to Dynamic Sky Replacement

The sky is one of the vital components in outdoor photography as well as videography. The photographer usually has to deal with uncontrollable weather and lighting conditions, which leads to an overexposed or plain-looking sky. To overcome them, the photographer can use special hardware equipment that might not be affordable for everyone.
Software-based automatic sky editing is an affordable option, and recent computer vision advancements can benefit this space. Existing methods either require laborious and time-consuming manual work or have specific camera requirements. To overcome these issues, the authors of this paper proposed a new solution that can generate realistic and dramatic sky backgrounds in videos with controllable styles.
The proposed method, which we will overview in the next section is,
  • Purely vision-based,
  • Requires no user interactions,
  • Can be applied to either online or offline processing scenarios,
  • And does not require any special hardware.
The video by the authors shows some excellent results produced by the proposed method.



Overview of the Proposed Method



The proposed method consists of three key components:
  • Sky matting network
  • Motion estimator
  • Skybox

Sky Matting Network

Image matting plays an essential role in image and video editing and encompasses many methods to separate the foreground of interests from an image. The foreground, which in our case is everything except the sky, is separated by predicting a soft "matte".
Contrary to previous methods that rely on binary pixel-wise classification(foreground vs. sky), the proposed sky matting network produces soft sky matte for a more accurate detection result and a more visually pleasing blending effect.
The authors have used Deep Convolutional Neural Network based U-shaped network that consists of an encoder EE and a decoder DD. This network predicts coarse sky matte. A coarse-to-fine refinement module takes in the course map and the high-resolution input frame to produce refined sky matte.

Details On U-shaped Network



  • The authors used ResNet-50 as the encoder network.
  • The decoder is not symmetric to the encoder but uses convolutional layers with upsampling layers.
  • Since the sky region usually appears at the upper part of the image, the conventional convolutional layers are replaced with coordinate convolutional layers at the encoder's input layer and all the decoder layers.
  • Skip connections were applied between the encoder and the decoder layers with the same spatial size.
  • L2L2 loss was applied in the raw pixel space between the predicted sky matte and the ground truth matte.

Details of the Refinement Module

  • The coarse sky matte is first upsampled to the original input resolution in this stage.
  • The module takes in the upsampled coarse sky matte and the high-resolution input frame(guidance image). The authors have used the Guided Filtering technique instead of upsampling convolutional operation or adversarial training.
  • This technique is used because it has better behaviors near edges and has high efficiency and simplicity.
  • By using the blue channel(for better contrast) of the guidance image, the filtering transfers the structures of the guidance image to the low-resolution sky matte. It produces a more detailed result with minimal computational overhead.

Motion Estimation

This component is responsible for capturing the motion of the sky. Why is that necessary? You will want the sky video captured by the "virtual camera" to be rendered and synchronized under the real camera's motion.
  • The previous methods were used to estimate the motion of the real camera. However, the proposed method assumes that the sky and the in-sky objects are located at infinity, and their movement relative to the foreground is Affine. The method estimates the motion of these objects.
  • The motion is estimated using the frames of the input video. The iterative Lucas-Kanade method with pyramids is used to compute optical flow. Thus a set of sparse feature points can be tracked frame-by-frame. These feature points are located within the sky area. If there are not enough feature points, depth estimation is run on the current fame to compute Affine parameters.
  • For each pair of adjacent frames, given two sets of 2D feature points, RANSAC-based robust Affine estimation is used to compute the optimal 2D transformation.
  • A background template image is used to get the final sky background in the frame tt. Using a simple method, the template is aligned using the computed Affine parameters.

Sky Image Blending

  • Let ItI^t, AtA^t, and BtB^t be the video frame, predicted sky matte, and the aligned sky template image at time tt. Let YtY^t be the output frame. It is a linear combination of ItI^t and BtB^t, with AtA^t as their pixel-wise combination weights. Thus YtY^t is given as,
  • Yt=(1At)It+AtBtY^t = (1 - A^t)I^t + A^tB^t
  • From this equation we can understand that the soft matte is nothing but the probabilistic prediction of the encoder-decoder architecture and refined by the guided filter.
  • Simple linear combination might give unrealistic results due to different color tones and intensities of the input frame and the aligned background. Thus recoloring and relighting techniques are used to transfer colors and intensity from the background to the foreground.

Results

Now let us admire the awesomeness of this proposed technique. The authors have used the method for video augmentation (sky replacement) and weather/lighting translation. Let us look at both of them separately.
I have built the linked colab notebook using the one provided by the authors but have simplified it so that you can augment your own video easily.

Reproduce the Results on Colab Notebook \rightarrow

Video Sky Augmentation



Run set
27



Weather/Lightning Translation



Run set
27

Sam Jackson
Sam Jackson •  
Thank you for this report Ayush. The results look awesome. I was wondering how you go about writing these paper breakdown. By the way I used the colab notebook and was able to generate my own videos with the sky of my choice. Indeed the sky is in our grasp.
1 reply
Ayush Chaurasia
Ayush Chaurasia •  
The results look awesome. Can these results also be obtained using a photo-realistic style transfer model? I think the idea is pretty similar. Would love to know your thoughts
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.