Skip to main content

An Explanation of Style Transfer With a Showdown of Different Techniques

In this article, we explore the concept of style transfer, including what it is and how it works, and we evaluate different approaches for neural style transfer.
Created on November 23|Last edited on March 3
In this article, we'll explore what style transfer is, how it works, and contrast a few different approaches to style transfer:

input
gram
OT
Vincent
1
1 of 2


Table of Contents



What Is Style Transfer?

Style transfer is practically ancient as far as AI art techniques go. Introduced in 2015 in A Neural Algorithm of Artistic Style by Gatys et al., style transfer allows you to create something with the style of one image and the 'content' of another.
Here are a few examples of a cityscape in the style of canonical paintings:
Figure 2 from the original paper.
The approach presented in this 2015 paper is still the most popular for this technique, and most tutorials on style transfer treat this as the only available solution.
However, some newer techniques have been developed in the intervening years, which may have the potential to outperform the original. In this article, we will look at two new techniques and compare the results with those produced with the 'classic' algorithm.
If you'd like to see a video explanation of some of these ideas, I cover style transfer in the following fastai lesson:


Content Loss

The first component of style transfer is having some way to capture the structure (a.k.a. "content") of an image that is somewhat independent of style. In other words, we'd like a way to measure the similarity between two images that captures what objects are where without relying on simple features like color.
How do we do this? Consider a neural network trained to classify images. Deep networks can learn hierarchical features, where early layers capture things like shapes and color gradients while later layers capture more semantic features:
We can exploit this fact to calculate a "perceptual loss," which measures the similarity of two images according to how they are "perceived" by a pre-trained network. Specifically, we'll extract the outputs from one or more internal layers and compare these between images. If the features are similar, we assume the images depict the same thing.
Extracting feature maps with a pre-trained model
Here's how we extract the features from some target layers in the code. This report assumes the use of a pre-trained network called VGG16 – an older architecture that turns out to be well-suited to style transfer applications.
def calc_features(imgs, target_layers=(18, 25)):
x = normalize(imgs)
feats = []
for i, layer in enumerate(vgg16[:max(target_layers)+1]):
x = layer(x)
if i in target_layers:
feats.append(x.clone())
return feats
By targeting different layers, we can control what kinds of features we're optimizing for. The notebook below shows some examples of images optimized for similarity at different layers:


Style Loss #1: Gram Matrices (2015)

We mentioned above that earlier layers in the network tend to capture features such as color and texture. Perhaps we can use the same approach we applied to content loss as our "style loss" just by focusing on earlier layers?
No such luck - unfortunately, the feature maps from the early layers don't only show what features are present. They record where in the image those features are. Enter the gram matrix, a way to remove the spatial component and focus only on what kinds of features are present.
Calculating the gram matrix from a small feature map (simplified)
We can calculate gram matrices for each target layer in the network using some einsum magic:
def calc_grams(img, target_layers=(1, 6, 11, 18, 25)):
return L(torch.einsum('chw, dhw -> cd', x, x) / (x.shape[-2]*x.shape[-1])
for x in calc_features(img, target_layers))
This is a measure of what stylistic features are present and how correlated they are (i.e., which features tend to occur together) without any spatial component. So we can now compare the gram matrices of two images to measure how stylistically similar they are and use the difference (typically the mean squared error) as our loss function when optimizing our output image.
Style Transfer with the normal Gram Matrix approach
If we combine both a style loss and a content loss, we will get an output that matches the overall layout of the content image while incorporating the same stylistic features as the style image. By varying which layers we focus on and scaling the different components of the loss function, we can get a number of different effects.

Style Loss #2: Optimal Transport

The inspiration for this version comes from the following excellent video, which also explains the loss function better than I can in text. Give it a watch before moving on to the code!



I modified the code shown in the video to fit with the FastAI notebook. It can be hard to tell what is going on at first glance. We're projecting the feature distributions onto different random lines, then computing an optimal transport-based loss for each by sorting the projected points and finding the MSE between corresponding points. See the video above for a good explanation.
class OTStyleLossToTarget():
def __init__(self, target_im, target_layers=(1, 6, 11, 18, 25), size=256):
self.target_layers = target_layers
with torch.no_grad(): self.target_features = calc_features(target_im, target_layers)

def project_sort(self, x, proj):
return torch.einsum('cn,cp->pn', x, proj).sort()[0]

def ot_loss(self, source, target, proj_n=32):
# Get shapes
(c, h, w), (ct, ht, wt) = source.shape, target.shape
# 32 random projections
projs = F.normalize(torch.randn(c, proj_n).to(def_device), dim=0)
# Project the source & target. source_proj.shape = (32, h*w)
source_proj = self.project_sort(source.reshape(c, h*w), projs)
target_proj = self.project_sort(target.reshape(c, ht*wt), projs)
# Make it so target shape matches source even if the images are different size
target_interp = F.interpolate(target_proj[None], h*w, mode='nearest')[0]
# Sum of the squared errors
return (source_proj-target_interp).square().sum()

def __call__(self, input_im):
input_features = calc_features(input_im, self.target_layers)
return sum(self.ot_loss(x, y) for x, y in
zip(input_features, self.target_features))
Using this loss instead of the gram-matrix-based approach gives a slightly different result, which I quite like:
Style Transfer with an Optimal Transport based loss
Be sure to check out the table at the end for more comparisons.

Style Loss #3: "Vincent's Loss"

The final loss function here was invented by a student in the FastAI course a number of years ago. The original code is available here in TensorFlow, and I've translated it into PyTorch in the accompanying notebook. Vince also made a notebook explaining the downsides to the gram-based approach here.
Style transfer using this Wasserstein Distance approach
His method assumes that the feature distributions can be approximated as Gaussian distributions, which can be described by their means and co-variances. This allows us to compare two images by taking the L2 Wasserstein distance between them (a.k.a. Earth Mover's distance, a way to quantify the difference/distance between two distributions).
It's a lot of scary terms, but in essence, this is yet another way to find a description of the kinds of features present without recording where they are found spatially and to compare these 'summary statistics' between images to give us a final loss value.

Comparing Results

I've logged the results for a number of different content-style pairs with different style losses and settings (use the arrows to explore the different results and check them out full-screen):


Different parameters (image size, learning rate, target layers, etc.) can lead to drastically different outcomes, whichever style loss is being used. Still, in general, the newer techniques do tend to look better to me. Browse through and let me know which you like best!

Applications

In addition to regular style transfer, these kinds of loss functions can be useful for
  • Guiding diffusion models - for example, see my recent report on 'Mid-U guidance' and think about how the UNet itself could act like the pre-trained network we've been using as a feature extractor in this report
  • Training 'Texture Neural Cellular Automata', which I covered in detail here
  • As additional loss functions when training generative models. For example, with diffusion models, you can train based on a perceptual loss to the denoised image rather than just using the MSE on the noise prediction like most papers do. This often helps!


Ethics

Before we end this article, it is important to highlight the potential ethical implications of something that can remix the content or style of an image. Please don't use photographs without knowing the permissions granted by the creator, and likewise, be sensitive before "stealing" the style from your favorite artist. Instead, consider finding interesting natural textures or source material that is in the public domain to work with.

Conclusions

In this article, we've seen how pre-trained neural networks and some clever maths allow us to capture some rich measures of the 'style' of an image, which we can put to use, creating all sorts of beautiful images.
If you'd like to try out these algorithms yourself, check out the notebook to test the different loss functions on your own pictures. Have fun, and please let me know what you make :)
Jonathan Whitaker
Jonathan Whitaker •  
The amazing @RiversHaveWings has explored style transfer and chimed in with some great insights on Twitter - check out https://twitter.com/RiversHaveWings/status/1631523788272852992?s=20 and the linked code!
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.