Monocular to 3D Virtual Try-On
An examination of a new technique that lets shoppers try on clothes in a 3D place and insights about the process with W&B
Created on July 27|Last edited on September 9
Comment
Table of Contents (Click to Expand)
IntroductionUnderstanding M3D-VTONMonocular Prediction Module (MTM)Depth Refinement Module (DRM)Texture Fusion Module (TFM)Point Cloud Generation and RenderingBringing it All Together
Introduction
Though the name is a bit cumbersome, "3D Virtual Try-On" sounds like what you'd expect: a way to virtually try on clothes outside an actual brick-and-mortar location.
Let's take a look at a paper entitled Monocular to 3D Virtual Try-On (M3D-VTON) to gain a better understanding. As the name suggests, this work provides a state-of-the-art algorithm to generate 3D models of humans trying on a garment — directly from two individual monocular images: the person and the item of clothing.
In this report, we'll go through a detailed explanation of the various components of the pipeline and use tools from the W&B ecosystem to provide interactive and insightful telemetry of the training process.
Understanding M3D-VTON
On the base level, the M3D pipeline primarily consists of three parts: the Monocular Prediction Module (MPM, aka MTM), the Depth Refinement Module (DRM), and Texture Fusion Module (TFM).

Complete M3D-VTON pipeline
Monocular Prediction Module (MTM)
The MTM module serves as a preparatory stage, responsible for garment alignment, segment map prediction, and depth map prediction, which serves as guidance for the other two modules down the line.
These tasks are accomplished by the three branches of the MTM module, namely:
- cloth warping branch
- segmentation prediction branch
- depth estimation branch
The MTM branch can perform all these tasks by using the features obtained from the cloth image and the cloth agnostic person representation .
The person agnostic representation is a 29 channel feature map, which consists of a 25 channel pose map (obtained from OpenPose), 3 channels of parsed human representation (obtained by applying JPPNet to the person image ), and 1 channel of (coarse) person mask i.e., a binary segmentation map of the person (obtained by thresholding ).
💡
Cloth Warping Branch
Much like other virtual try-on algorithms, this branch performs a texture-preserving alignment of the target cloth with the person using a geometric matching network.
This is achieved by using a Thin Plate Spline (TPS) whose parameters are calculated by correlating cloth features (calculated using ) and person features (calculated using ). The paper proposes a novel self-adaptive pre-alignment process to aid the parameter estimation process. Doing so transforms the cloth image to have the same position and dimensions as the arm-torso region (obtained from the person segmentation map, of the person image .
The procedure is formulated as an affine transform on formulated as:
...where and represent the centers of arm-torso region and clothing respectively. is the rescaling factor determined as:
...where and are the height and width of and . Intuitively, this operation centers the clothing with respect to and resizes to have the same size as , thereby making it easier for TPS to compare these two quantities. Finally, the parameter of TPS is learned by minimizing the following loss:
Where is the predicted warped cloth obtained after applying the TPS on . is the clothing that's already on the person (found by applying the arm-torso mask to the person image )
Segmentation Branch
The segmentation map serves to provide guidance to the texture fusion module (TFM) while inpainting the target cloth on the person .
This is important since it helps avoid clothing-skin penetration. The segmentation branch provides the human segmentation results by comparing the prediction of the segmentation decoder . The training is done by comparing the predicted and ground truth segmentation maps using a pixel-level cross-entropy loss term .
Depth Estimation Branch
This branch is responsible for predicting the depth map of the person in the double-depth format.
Double depth format refers to storing the the depth map of each side of a person, front and back.
💡
The features obtained from encoders and are reused in this branch to predict the double depth map of the person . This is done by concatenating the features produced by both these encoders and passing them to the depth decoder . The decoder is trained by penalizing the predicted maps( and ) with respect to the ground truth depth maps: (front) and (back) obtained from the dataset.
Formally:
The MTM module is trained by jointly optimizing all three loss components as:
Experiments & Visualizations:
MTM
1
Depth Refinement Module (DRM)
The depth map produced by the MTM module does not contain finer depth (higher frequency) information like clothing pleats since the inputs lack the warped clothing information (). Along with that training, is the L1 loss for depth map generation which has a smoothing effect since it penalizes the low-frequency features.
Hence, in this module, the primary task is to refine the depth maps predicted by MTM by adding higher frequency features.
The refinement process is done in two steps:
- Extract brightness change information from warped clothing.
- Refine depth using a UNET-like generator.
The change in brightness from cloth pleats are extracted from the image by applying the Sobel operator on the warped cloth and the segmented out person part . The gradient images obtained from both these processes are concatenated to produce the image gradient . Finally , , and the initial depth map is fed to the UNET-like generator for producing the refined depth map .
To aid the neural network in learning higher frequency features, the regular L1 loss-function is replaced by Log-L1 loss which penalizes heavily on points that are close to each other.
Formally:
...where is the total number of front and back depth map points, denotes the L1 difference of the refined and ground truth depth maps at point .
To further aid the optimization for depth estimation, especially at the boundaries, another loss function called the depth gradient loss is incorporated.
...where and is the Sobel operator along width and height of the image.
The depth refinement module is trained by jointly optimizing both these losses and as:
For our experiments we set and .
Experiments & Visualizations
DRM
1
Texture Fusion Module (TFM)
Running in parallel to the Depth Refinement Module, the Texture Fusion Module is responsible for generating photo-realistic image of the person with the warped cloth inpainted.
More formally, it fuses the unchanged person part with the warped cloth from the outputs of MTM module, namely the warped cloth , the segmentation map and the initial depth estimates for the front part . Trained with an UNet like generator , the Texture Fusion Module considers the 2D clues from , and and also the depth information to generate a coarse try-on result and a fusion mask ( 1 for cloth, 0 for everything else).
Using a depth map for this task, provides the Texture Fusion Module with enough information to synthesize even the complicated self-occlusion cases.
💡
The final 'refined' frontal try-on image ( ) is then formulated as:
The texture synthesis module is trained using a combination of perceptual loss between the refined synthesis result and the ground truth person image ; L1 loss , between and ; and L1 loss between estimated fusion mask and ground truth clothing-on-person mask .
Perceptual loss refers to encoding the two images (generated and ground truth) with the feature extractor section of a pre-trained classifier (like VGG-19) and computing the L1 loss between the feature vectors.
💡
The final loss for the full module is formulated as:
Experiments & Visualizations
TFM
1
Point Cloud Generation and Rendering
The final (front-view) texture generated from the TFM is then used to create a simulated back-view texture.
This is done by using the Fast Marching Method to inpaint the face-region of a copy of the try-on texture (generated using TFM) with the color of the surrounding color of the hair. This new texture is then mirrored to simulate the back view texture . Now, these textures are then combined with final double-depth map obtained from the DRM module to generate the 3D point cloud of the full person.
Since the Texture Image ( and ) generation is guided by the initial depth map estimates , they are already aligned with the final depth map . Hence, there is no extra alignment step to generate the point clouds.
💡
[Optional] The point cloud obtained can also be triangulated using a surface reconstruction pipeline like Screened Poisson Reconstruction .
Bringing it All Together
Visualizing 3D assets using W&B is surprisingly easy. All it takes is two lines of code!
data = wandb.Object3D(<(N, 4) dimension array of colored point cloud>)wandb.log({"point_cloud": data})
Finally, let's test out the algorithm end to end on an arbitrary person and clothing image. We'll visualize all the results in a structured format using our latest addition to the feature list, W&B Weave.
All the 3D models visualized in the 4th column can be moved around by going full screen.
💡
Evaluate
2
Add a comment