Comparative Analysis of Virtual Try-On Methods
An exploration of different techniques transforming the online shopping experience
Created on July 27|Last edited on September 9
Comment
Table of Contents (Click to Expand)
IntroductionBackgroundClothing Agnostic RepresentationTPS TransformationCP-VTONGeometric Matching ModuleTry-On ModuleACGPNSemantic Generation ModuleClothes Warping ModuleContent Fusion ModuleM3D-VTONMonocular Prediction ModuleDepth Refinement ModuleTexture Fusion ModuleQualitative Comparison
Introduction
Virtual try-on is a machine learning task offering subjects the ability to 'try on' items of their choice, such as clothing. Doing so creates a virtual trial room — a useful feature, particularly for retailers.
Historically, the virtual try-on task was done using methods of computational physics. However, these methods were difficult to scale due to the high costs of installing hardware, and collecting 3D annotated data inhibited their large-scale deployment.
Several new virtual try-on methods were proposed to address those limitations. We'll be comparing three popular techniques: CP-VTON, ACGPN, and M3D-VTON.
Background
The concept of machine learning based virtual try-on can be traced back to VITON, where they introduced the implementation of two central ideas that have carried over in almost every other implementation of virtual try-on methods: Clothing Agnostic Representation and TPS Transformation.
Clothing Agnostic Representation
One of the main technical challenges of the virtual try-on task is to collect a dataset of triples - where is the image of a person, is the image of the desired clothing item and is the target image, where person is wearing the clothing item .
To overcome this, the authors of VITON proposed a "Clothing Agnostic Representation" where the input image is transformed into a 22-channel representation consisting of the following components:
- Pose Heatmap: an 18-channel feature map with each channel corresponding to one human pose keypoint, drawn as an 11x11 rectangle.
- Body Shape: a 1-channel feature map of a blurred binary mask that roughly covers different parts of the human body, such as arms and legs, in order to retain the body shape of the person .
- Reserved Regions: the representation also consists of an RGB image containing identifying information about the person which don't have to be changed, such as face and hair.
Using this representation, one can use triples as . Since all the information regarding the clothes is processed out, the system is "agnostic" to the clothing and learns to generate clothing for a person in any pose without requiring a new target image to train over.

A Clothing Agnostic Representation for an image consists of identifying information like body shape and pose keypoints.
TPS Transformation
A Thin Plate Spline (TPS) is the 2D equivalent of a cubic spline in one dimension.
Given a set of data points, a weighted combination of the TPS centered around each data point gives the interpolation function that passes through the points exactly while minimizing the "bending energy", which is defined as the integral over :
The parameters to generate this transformation for a given pose are learned and this transformation is then used to warp the clothing into the desired shape before being superimposed on the target.
Since a TPS minimizes bending energy, it simulates the effect a set of clothes would have in the real world while being completely differentiable, allowing for back propagation.

A TPS Transformation is used to warp the clothing to fit a target's shape.
CP-VTON
While VITON was able to get a realistic representation of how the clothes would fit on a user, they were unable to retain the defining characteristics of clothes, due to the Coarse-to-Fine structure of the Model.
The authors of CP-VTON aimed to improve this by introducing a new Geometric Matching Module (GMM) and Try-On Module (TOM).
Geometric Matching Module
The loss for the Geometric Matching Module is calculated as:
where is a TPS transformation with parameters and is the target shape.
The training metrics for this can be seen below:
Try-On Module
The loss for the Try-On Module is calculated as:
where is a feature map of the i-th layer in the visual perception network , which is VGG trained on ImageNet.
The training metrics for this can be seen below:
ACGPN
The goal of ACGPN was to preserve the characteristics of images that would be left out in VITON and CP-VTON, such as hands and arms, which can occlude the clothing in certain poses.
It consists of three modules: Semantic Generation Module, Clothes Warping Module, and Content Fusion Module.
Semantic Generation Module
The Semantic Generation Module works to separate the target clothing region and preserve the person's body parts. The loss of the module is computed as:
where (Conditional GAN loss) and is the pixel-wise cross entropy.
Clothes Warping Module
The Clothes Warping Module works to warp the clothing images according to the generated semantic layout that uses a second order difference constraint introduced to stabilize the TPS warp during training. The loss for this module is computed as:
where and .
Content Fusion Module
The Content Fusion Module works to retain the fine-scale details of the target human and generate the final try-on image.
The results for training with ACGPN are visualized below:
M3D-VTON
M3D-VTON takes a novel approach, where it attempts to generate a 3D try-on result from 2D input. The authors of the paper state that their method tries to take the benefits of both 2D and 3D virtual try-on since a 3D result is more realistic and gives a better sense of the product, but 2D data is easier to capture and process on the consumer end.
The 3D training data was synthetically produced using PIFu-HD.
M3D-VTON consists of three modules: Monocular Prediction Module, Depth Refinement Module, and Texture Fusion Module.
Monocular Prediction Module
The Monocular Prediction Module plays a preparatory role in the model. It provides constructive guidance for the other two modules by warping the clothing, predicting a person segmentation, and estimating a base 3D shape using a multi-target network.
It consists of three branches:
- Clothing Warping Branch: this branch learns the parameters for the TPS warp on the clothing and aligns the clothes to the target person. Its loss is computed as:
- Conditional Segmentation Estimation Branch: this branch predicts a Person Segmentation for the rest of the model to follow. It uses a pixel-level cross-entropy to optimize this branch.
- Depth Estimation Branch: this branch generates a base 3D Shape of the target person. The 3D shape is represented in a "Double Depth" form, i.e., a front and back depth map corresponding to the respective sides of the 3D human representation. The loss for this module is computed as:
The module combines all these three losses as:
The training results for this module are visualized below:
Depth Refinement Module
The depth map from the MPM fails to capture geometric details such as clothing details, and facial characteristics. This is largely due to the fact that the L1 loss used in the MPM tends to penalize low-frequency differences in the depth maps. In order to compensate for this, the authors of M3D-VTON use a Depth Refinement Module that aims to add high frequency details to the depth map. Its loss is calculated as:
where
where denotes the L1 loss at the i-th depth point, and denotes the Sobel Edge detector.
The training results for this module are as follows:
Texture Fusion Module
Finally, the Texture Fusion Module is responsible for synthesizing a photo-realistic body texture for the 3D mesh generated by the DRM, which it achieves by fusing the warped clothing with the original image of the person.
Specifically, given the unchanged person part , the warped clothing , the Segmentation mask and the frontal body depth map , the TFM considers the 2D information provided by along with the depth information from to generate a coarse try-on result and a fusion mask .
The final try-on result is then generated as:
The TFM is trained using a perceptual loss between the try on result and , the L1 loss between and , as well as the L1 loss between the estimated fusion mask and real clothing mask as:
The results for this training step are as follows:
Qualitative Comparison
As you can tell, each virtual try-on method has its advantages and differs from one another. Using Weights & Biases, it's extremely easy to compare models qualitatively by logging results into our own Tables.
All the 3D models in the last column can be expanded and inspected by going full screen.
💡
Add a comment