Skip to main content

Monocular to 3D Virtual Try-On

An examination of a new technique that lets shoppers try on clothes in a 3D place and insights about the process with W&B
Created on July 27|Last edited on September 9

Table of Contents (Click to Expand)



Introduction

Though the name is a bit cumbersome, "3D Virtual Try-On" sounds like what you'd expect: a way to virtually try on clothes outside an actual brick-and-mortar location.
Let's take a look at a paper entitled Monocular to 3D Virtual Try-On (M3D-VTON) to gain a better understanding. As the name suggests, this work provides a state-of-the-art algorithm to generate 3D models of humans trying on a garment — directly from two individual monocular images: the person and the item of clothing.
In this report, we'll go through a detailed explanation of the various components of the pipeline and use tools from the W&B ecosystem to provide interactive and insightful telemetry of the training process.
The associated codebase can be found here.

Understanding M3D-VTON

On the base level, the M3D pipeline primarily consists of three parts: the Monocular Prediction Module (MPM, aka MTM), the Depth Refinement Module (DRM), and Texture Fusion Module (TFM).
Complete M3D-VTON pipeline

Monocular Prediction Module (MTM)

The MTM module serves as a preparatory stage, responsible for garment alignment, segment map prediction, and depth map prediction, which serves as guidance for the other two modules down the line.
These tasks are accomplished by the three branches of the MTM module, namely:
  • cloth warping branch
  • segmentation prediction branch
  • depth estimation branch
The MTM branch can perform all these tasks by using the features obtained from the cloth image CC and the cloth agnostic person representation AA.
The person agnostic representation AA is a 29 channel feature map, which consists of a 25 channel pose map (obtained from OpenPose), 3 channels of parsed human representation IpI^p (obtained by applying JPPNet to the person image II), and 1 channel of (coarse) person mask i.e., a binary segmentation map of the person (obtained by thresholding IpI^p).
💡

Cloth Warping Branch

Much like other virtual try-on algorithms, this branch performs a texture-preserving alignment of the target cloth with the person using a geometric matching network.
This is achieved by using a Thin Plate Spline (TPS) whose parameters θ\theta are calculated by correlating cloth features (calculated using EC\mathcal{E}_C) and person features (calculated using EA\mathcal{E}_A). The paper proposes a novel self-adaptive pre-alignment process to aid the parameter estimation process. Doing so transforms the cloth image CC to have the same position and dimensions as the arm-torso region IatI^{at}(obtained from the person segmentation map, Ip)I^p) of the person image II.
The procedure is formulated as an affine transform on CC formulated as:
Caff=[R00R]C+[xIatcxCcyIatcyCc]C^{aff} = \begin{bmatrix} R & 0 \\ 0 & R \end{bmatrix} C + \begin{bmatrix} x^c_{I^{at}} - x^c_C \\ y^c_{I^{at}} - y^c_C \end{bmatrix}

...where (xIatc,yIatc)(x^c_{I^{at}}, y^c_{I^{at}}) and (xCc,yCc)(x^c_C, y^c_C) represent the centers of arm-torso region IatI^{at} and clothing CC respectively. RR is the rescaling factor determined as:
R={hIathC,wChCwIathIatwIatwC,wChC<wIathIatR = \begin{cases} \frac{h^{at}_I}{h_C}, & \frac{w_C}{h_C} \ge \frac{w^{at}_I}{h^{at}_I} \\ \frac{w^{at}_I}{w_C}, & \frac{w_C}{h_C} \lt \frac{w^{at}_I}{h^{at}_I} \end{cases}

...where (hIat,wIat)(h^{at}_I, w^{at}_I) and (hC,wC)(h_C, w_C) are the height and width of IatI^{at} and CC. Intuitively, this operation centers the clothing CC with respect to IatI^{at} and resizes CC to have the same size as IatI^{at}, thereby making it easier for TPS to compare these two quantities. Finally, the parameter θ\theta of TPS is learned by minimizing the following loss:
Lw=CwIc1\mathcal{L}_w = \| C^w - I^c \|_1

Where CwC_w is the predicted warped cloth obtained after applying the TPS on CaffC^{aff}. IcI^c is the clothing that's already on the person (found by applying the arm-torso mask IatI^{at} to the person image II)

Segmentation Branch

The segmentation map serves to provide guidance to the texture fusion module (TFM) while inpainting the target cloth CC on the person II.
This is important since it helps avoid clothing-skin penetration. The segmentation branch provides the human segmentation results by comparing the prediction of the segmentation decoder Ds\mathcal{D}_s. The training is done by comparing the predicted and ground truth segmentation maps using a pixel-level cross-entropy loss term Ls\mathcal{L}_s.

Depth Estimation Branch

This branch is responsible for predicting the depth map of the person in the double-depth format.
Double depth format refers to storing the the depth map of each side of a person, front and back. [cite]^{[\textrm{cite}]}
💡
The features obtained from encoders EA\mathcal{E}_A and EC\mathcal{E}_C are reused in this branch to predict the double depth map of the person II. This is done by concatenating the features produced by both these encoders and passing them to the depth decoder DZ\mathcal{D}_Z. The decoder is trained by penalizing the predicted maps(DfiD^i_f and DbiD^i_b) with respect to the ground truth depth maps: DfgtD_f^{gt} (front) and DbgtD^{gt}_b (back) obtained from the dataset.
Formally:
Lz=DfiDfgt1+DbiDbgt1L_z = \|D_f^i - D_f^{gt}\|_1 + \|D^i_b - D^{gt}_b\|_1


The MTM module is trained by jointly optimizing all three loss components as:
LMPM=Lw+Ls+Lz\mathcal{L}_{\textrm{MPM}} = \mathcal{L}_w + \mathcal{L}_s + \mathcal{L}_z



Experiments & Visualizations:


MTM
1



Depth Refinement Module (DRM)

The depth map produced by the MTM module does not contain finer depth (higher frequency) information like clothing pleats since the inputs lack the warped clothing information (CwC^w). Along with that training, is the L1 loss for depth map generation which has a smoothing effect since it penalizes the low-frequency features.
Hence, in this module, the primary task is to refine the depth maps predicted by MTM by adding higher frequency features.
The refinement process is done in two steps:
  1. Extract brightness change information from warped clothing.
  2. Refine depth using a UNET-like generator.
The change in brightness from cloth pleats are extracted from the image by applying the Sobel operator on the warped cloth CwC^w and the segmented out person part IpI^p. The gradient images obtained from both these processes are concatenated to produce the image gradient IgI^g. Finally IgI^g, CwC^w, IpI^p and the initial depth map DiD^i is fed to the UNET-like generator for producing the refined depth map Dr\mathcal{D}^r.
To aid the neural network in learning higher frequency features, the regular L1 loss-function is replaced by Log-L1 loss which penalizes heavily on points that are close to each other.
Formally:
Ldepth=1Ni=1Nln(ϵi+1)\mathcal{L}_{\textrm{depth}} = \frac{1}{N} \sum\limits_{i = 1}^N \mathrm{ln}\, (\epsilon_i + 1)

...where NN is the total number of front and back depth map points, ϵi=DirDigt1\epsilon_i = \|D^r_i - D^{gt}_i\|_1 denotes the L1 difference of the refined and ground truth depth maps at point ii.
To further aid the optimization for depth estimation, especially at the boundaries, another loss function called the depth gradient loss is incorporated.
Lgrad=1Ni=1Nln(x(ϵi)+1)+ln(y(ϵi)+1)\mathcal{L}_{\textrm{grad}} = \frac{1}{N} \sum\limits_{i = 1}^N \mathrm{ln}\, (\nabla_x(\epsilon_i) + 1) + \mathrm{ln}\, (\nabla_y(\epsilon_i) + 1)

...where x\nabla_x and y\nabla_y is the Sobel operator along width and height of the image.
The depth refinement module is trained by jointly optimizing both these losses Ldepth\mathcal{L}_{\textrm{depth}} and Lgrad\mathcal{L}_{\textrm{grad}} as:
LDRM=λdepthLdepth+λgradLgrad\mathcal{L}_{\textrm{DRM}} = \lambda_{\textrm{depth}} \,\mathcal{L}_{\textrm{depth}} + \lambda_{\textrm{grad}} \,\mathcal{L}_{\textrm{grad}}

For our experiments we set λdepth=1.0\lambda_{\textrm{depth}} = 1.0 and λgrad=0.5\lambda_{\textrm{grad}} = 0.5.

Experiments & Visualizations


DRM
1


Texture Fusion Module (TFM)

Running in parallel to the Depth Refinement Module, the Texture Fusion Module is responsible for generating photo-realistic image of the person with the warped cloth inpainted.
More formally, it fuses the unchanged person part IpI^p with the warped cloth CwC^w from the outputs of MTM module, namely the warped cloth CwC^w, the segmentation map SS and the initial depth estimates for the front part DfiD^i_f. Trained with an UNet like generator GT\mathcal{G}_T, the Texture Fusion Module considers the 2D clues from SS, IpI^p and CwC^w and also the depth information DfiD^i_f to generate a coarse try-on result Ic~\tilde{I^c} and a fusion mask M~\tilde{M} ( 1 for cloth, 0 for everything else).
Using a depth map for this task, provides the Texture Fusion Module with enough information to synthesize even the complicated self-occlusion cases.
💡
The final 'refined' frontal try-on image ( IftI^t_f ) is then formulated as:
Ift=CwM~+Ic~(1M~)I^t_f = C^w \odot \tilde{M} + \tilde{I^c} \odot (1 - \tilde{M})

The texture synthesis module is trained using a combination of perceptual loss Lperc\mathcal{L}_{\textrm{perc}} between the refined synthesis result IftI^t_f and the ground truth person image II; L1 loss Ltry-on\mathcal{L}_{\textrm{try-on}}, between IftI^t_f and II; and L1 loss Lmask\mathcal{L}_{\textrm{mask}} between estimated fusion mask M~\tilde{M} and ground truth clothing-on-person mask MM.
Perceptual loss refers to encoding the two images (generated and ground truth) with the feature extractor section of a pre-trained classifier (like VGG-19) and computing the L1 loss between the feature vectors.
💡
The final loss for the full module is formulated as:
LTFM=Lperc+Ltry-on+Lmask\mathcal{L}_{\textrm{TFM}} = \mathcal{L}_{\textrm{perc}} + \mathcal{L}_{\textrm{try-on}} + \mathcal{L}_{\textrm{mask}}


Experiments & Visualizations


TFM
1


Point Cloud Generation and Rendering

The final (front-view) texture generated from the TFM is then used to create a simulated back-view texture.
This is done by using the Fast Marching Method [cite]^{[\textrm{cite}]} to inpaint the face-region of a copy of the try-on texture IftI^t_f (generated using TFM) with the color of the surrounding color of the hair. This new texture is then mirrored to simulate the back view texture IbtI^t_b. Now, these textures are then combined with final double-depth map obtained from the DRM module DrD^r to generate the 3D point cloud of the full person.
Since the Texture Image (IftI^t_f and IbtI^t_b) generation is guided by the initial depth map estimates DiD^i, they are already aligned with the final depth map DrD^r. Hence, there is no extra alignment step to generate the point clouds.
💡
[Optional] The point cloud obtained can also be triangulated using a surface reconstruction pipeline like Screened Poisson Reconstruction [cite]^{[\textrm{cite}]}.

Bringing it All Together

Visualizing 3D assets using W&B is surprisingly easy. All it takes is two lines of code!
data = wandb.Object3D(<(N, 4) dimension array of colored point cloud>)
wandb.log({"point_cloud": data})
Finally, let's test out the algorithm end to end on an arbitrary person and clothing image. We'll visualize all the results in a structured format using our latest addition to the feature list, W&B Weave.
All the 3D models visualized in the 4th column can be moved around by going full screen.
💡

Evaluate
2