Human Dynamics From Monocular Video With Dynamic Camera Movements

This article provides an in-depth overview of the paper "Human Dynamics from Monocular Video with Dynamic Camera Movements" by Ri Yu, Hwangpil Park, and Jehee Lee.
Saurav Maheshkar
Created on May 14|Last edited on February 3
Comment
﻿
﻿
﻿
﻿
In this article, we take an in-depth look into the paper "Human Dynamics from Monocular Video with Dynamic Camera Movements" written by Ri Yu, Hwangpil Park, and Jehee Lee. We will begin with an explanation of pose estimation before outlining our method and diving into the topic of human dynamics. 
Here's what we'll be covering:  
Table of Contents (Click to Expand)What is Pose Estimation?MethodHuman DynamicsReferencesRecommended Reading
﻿
﻿
Let's dive in! 
What is Pose Estimation?Pose Estimation is the general task of estimating orientations and poses of various objects based on localizing joints (articulations). You'll frequently see Human Pose Estimation in real-world use cases like gaming and animation. However, it can be an incredibly hard problem to tackle. Why? Briefly:
Strong occlusions by other body parts
Small and barely visible joints lead to invisible key points
Variants due to clothing, body type, and lighting
A general need for "holistic" (global) context
Large space of possible poses owing to the high degree-of-freedom (DOF) possible configurations of the human body
Deep Learning was successfully applied for Pose Estimation by the likes of Jain et al. (2013) and Toshev et al. (2014) using various training paradigms based on Convolutional Neural Networks (CNNs). This improved upon the older "part-based models" (joint-specific models based on extreme pre-processing and feature engineering), which lacked efficiency and global context (only modeled certain joints without context to overall pose).
Follow-up work such as the Cascaded Pyramid Network by Chen et al. (2018) improved upon these methods enabling multi-person pose estimation, following the modern two-stage pipeline of first creating human bounding boxes based on a detecter and then using a network for keypoint localization (creating a skeleton) based on these bounding boxes.
While 2D pose estimation is the task of estimating a 2-dimensional pose (x,y)\large (x,y)(x,y)﻿, 3D pose estimation is a more challenging problem that incorporates the depth factor as well, i.e., estimating 3-dimensional poses (x,y,z)\large (x, y, z)(x,y,z)﻿. A lot of important work in 3D pose estimation relies on global context using methodologies like skeleton fitting, temporal networks, human-object interactions, and adversarial priors.
For a long time, optical motion-capture-based datasets were used to acquire high-quality human motion data, albeit being extremely costly and hard to set up. On the other hand, videos are a readily available source of video-based human motion, which are easily accessible and extremely low cost. 
However, most work in video-based pose estimation has been based on the assumption that the video is a static shot with a fixed camera setting (monocular) and has even exploited this known setting and calibration.
Method
Figure 1: Overview of the System. (Source: Figure 2 from the paper)
The pipeline takes a video clip and a simulated character model as input, along with an environmental model consisting of geometric primitives. This model helps estimate the scene geometry arrangement while reconstructing 3D human interactions with the primitives.
There a total of 5 models involved:
2D Pose Estimator: The authors use pre-trained models from the OpenPose Framework for predicting 2D poses. Although these models were pre-trained from large training datasets, they differ from the dynamic motions we aim to reconstruct (dancing, parkour) and, therefore, sometimes fail. Therefore to overcome this, the 2D pose estimator is fed over 0∘,90∘,180∘,270∘\large 0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}0∘,90∘,180∘,270∘﻿ rotated version of each frame to select the best pose output based on temporal coherence and confidence values. 
3D Pose Estimator: The authors use pre-trained models from VIBE for reconstructing 3D poses. Whenever 3D pose estimation fails, the missing poses are filled by linear interpolation. These estimated 3D poses are converted into body representations in a format suited for physics simulation by inverse kinematics. 
Contact Estimator: Since these methods rely heavily on learning human dynamics based on physics priors, contact is an important source of information that will help us learn interactions. Based on prior work by Rempe et al., which developed networks to detect foot-ground contact from monocular video assuming a static view, the authors modified the system to work with dynamic-view videos. Using this network to get rendered images containing labels for contacts based on the height and velocity of the foot, a network is trained to estimate the contact in a supervised manner. The results from the 2D Pose Estimator are used as the input, which labels a binary value at each foot node (the pose estimator essentially outputs a graph consisting of links and nodes). The hand-object contact information, however, is manually given.
Policy Learner: We'll get into this shortly.
Scene Geometry Builder: This one too.
Human DynamicsThe dynamics model of our humanoid is a figure consisting of links and joints controlled by Proportional-Derivative (PD) servos. Our aim is to learn a control policy (controller) that learns to mimic 3D human movements in the video based on the estimated 2D and 3D poses, contact labels, and object interaction hints provided earlier in the pipeline. 
If π(at∣st,e)\large \pi(a_t | s_t, e)π(at​∣st​,e)﻿ os the control policy, where at\large a_tat​﻿ is the action and st\large s_tst​﻿ is the state of the dynamics model at some given time t\large tt﻿ in some environment e\large ee﻿, then the performed action will provide a target pose qtˉ\large \bar{q_t}qt​ˉ​﻿ for PD control at the time t\large tt﻿ of the form:
qtˉ=qt^+at\huge \bar{q_t} = \hat{q_t} + a_tqt​ˉ​=qt​^​+at​﻿
where q^\large \hat{q}q^​﻿ is a reference pose made from all joint orientations from 3D pose estimation?
The state is defined as an aggregated vector of the form:
st=[ϕt,pt,Rt,vt,ωt]\huge s_t = [ \phi_t, p_t, R_t, v_t, \omega_t]st​=[ϕt​,pt​,Rt​,vt​,ωt​]﻿
where :-
﻿ϕt∈[0,1]\large \phi_t \in [0,1]ϕt​∈[0,1]﻿ represents the time normalized by clip length
﻿pt\large p_tpt​﻿ represents the positions of body links
﻿Rt\large R_tRt​﻿ represents the orientations by unit quaternions of body links
﻿vt\large v_tvt​﻿ represents the linear velocity
﻿wt\large w_twt​﻿ represents the angular velocity
The environment e\large ee﻿ is an aggregated vector consisting of the height, size and location of objects which come in handy during the scene fitting process.
Quick Recap of Reinforcement LearningLet's do a quick overview of a discrete-time dynamics system using the deep reinforcement learning paradigm. 
💡
We search for a control policy say π\large \piπ﻿ that maximizes the discounted cumulative rewards Vπ(s)\large V^{\pi}(s)Vπ(s)﻿. At any given time t\large tt﻿, the agent in some state st\large s_tst​﻿ will take some action at\large a_tat​﻿. The environment e\large ee﻿ as a consequence responds to the said action by changing its state to st+1\large s_{t+1}st+1​﻿  and giving a reward of rt\large r_trt​﻿. Similarly the discounted cumulative rewards of the entire policy over all rewards with discount factors γ\large \gammaγ﻿ is given by:
Vπ(st)=Es0=st,ai∼π[∑i=0∞γiri]γ∈[0,1)\huge V^{\pi} (s_t) = \mathbb{E}_{s_0 = s_t, a_i \sim \pi} \left[\displaystyle \sum_{i=0}^{\infty} \gamma^{i} r_i \right] \hspace{2em} \gamma \in [0, 1)Vπ(st​)=Es0​=st​,ai​∼π​​i=0∑∞​γiri​​γ∈[0,1)﻿
And our optimal control policy would be:
π∗=arg max⁡πVπ\huge \pi^{*} = \argmax_{\pi} V^{\pi}π∗=argmaxπ​Vπ﻿
Reward DesignMost RL Algorithms follow this general framework, and the key differentiating success factor is the reward design. For this paper, the reward is quite intricate, consisting of many sub-terms. Let's have a closer look:
Good Tracking of Estimated Poses rtrack\large r_{track} rtrack​﻿﻿This reward encourages the accurate tracking of poses and consists of 5 sub-terms
Pose: rq\large r_qrq​﻿ favors a good match between estimated and simulated poses. If q^\large \hat{q}q^​﻿ is a vector of joint angles of the estimated model and q\large qq﻿ is a vector of joint angles of the simulated model, this reward is calculated using: 
rq=exp(−αq∣∣q^−q∣∣2)\large r_q = exp( -\alpha_q || \hat{q} -q ||^{2})rq​=exp(−αq​∣∣q^​−q∣∣2)﻿
Velocity: rv\large r_vrv​﻿ favors a good match between estimated and simulated velocities. If v\large vv﻿ is a vector of joint velocities calculated by finite difference. This reward is calculated using: 
rv=exp(−αv∣∣v^−v∣∣2)\large r_v = exp(-\alpha_v || \hat{v} - v ||^2)rv​=exp(−αv​∣∣v^−v∣∣2)﻿
Body Orientation: rori\large r_{ori}rori​﻿ favors a good match between body vectors estimated from video and simulation. Assuming a steady up-direction (throughout the video there is only 1 up direction), we use the angle θ^\large \hat{\theta}θ^﻿ between the view up-vector vY\large v_YvY​﻿(in our case a 2D vector pointing upward in the image-plane) and body up-vector vb\large v_bvb​﻿ in the image plane (in our case a 2D vector from the pelvis to the neck obtained from OpenPose) to act as a sort of hint for the body up-vector in 3D space. θ\large \thetaθ﻿ is the angle between the vector from the position of the pelvis joint to that of the head joint of the character and the y-axis in 3D space. This reward is calculated using:
rori=exp(−αori∣∣θ^−θ∣∣2)\large r_{ori} = exp (-\alpha _{ori} || \hat{\theta} - \theta ||^2)rori​=exp(−αori​∣∣θ^−θ∣∣2)﻿
Contact: rcontact\large r_{contact}rcontact​﻿ favors a good match between estimated contact states c^\large \hat{c}c^﻿ and simulated contact states c\large cc﻿ (in our case we only consider the four end-effectors i.e. hands and feet). As aforementioned in the article these states are binary flags and are calculated using: 
rcontact=exp(−αc∑l∈{rf,lf,rh,lh}∣∣xor(cl^,cl)∣∣2)\large r_{contact} = exp \left( - \alpha_c \displaystyle \sum_{l \in \{ rf,lf,rh,lh \}} || \text{xor}(\hat{c_l}, c_l) ||^2 \right)rcontact​=exp​−αc​l∈{rf,lf,rh,lh}∑​∣∣xor(cl​^​,cl​)∣∣2​﻿
Regularization: rreg\large r_{reg}rreg​﻿ this is a term that minimizes joint torques using standard L2 regularization to prevent excessive force and unnecessary movements. If τ\large \tauτ﻿ is an aggregated vector of torques, this reward is calculated using: 
rreg=exp(−αreg∣∣τ∣∣2)\large r_{reg} = exp \left( -\alpha_{reg} ||\tau||^2 \right)rreg​=exp(−αreg​∣∣τ∣∣2)﻿
Thus, the total Tracking Loss is given by:
rtrack=wqrq+wvrv+worirori+wcontactrcontact+wregrreg\huge r_{\text{track}} = w_qr_q + w_vr_v + w_{\text{ori}}r_{\text{ori}} + w_{\text{contact}}r_{\text{contact}} + w_{\text{reg}}r_{\text{reg}}rtrack​=wq​rq​+wv​rv​+wori​rori​+wcontact​rcontact​+wreg​rreg​﻿
﻿
For a closer look into regularization, refer to these reports:
Matrix Factorization from Scratch in JAX: Regularized SVD for Recommendation Systems
Bayesian Hyperparameter Search with Cross Validation for doubly-regularized Matrix Factorization on MovieLens.
Recurrent Neural Network Regularization With Keras
A short tutorial teaching how you can use regularization methods for Recurrent Neural Networks (RNNs) in Keras, with a Colab to help you follow along.
﻿
Good Alignment with Scene Objects rscene\large r_{scene}rscene​﻿﻿
Figure 2: rscene\large r_{\text{scene}}rscene​﻿ reward terms, namely Distance (left), Alignment (middle) and Center of Mass (right)
This reward encourages good alignment with the overall scene and consists of 3 sub-terms
Distance: rdist\large r_{dist}rdist​﻿ ensures that the simulated end-effector (hands and feet) are in contact with the desired object. If pl\large p_lpl​﻿ is the position of the end-effector and Zl\large Z_lZl​﻿ is the target zone of the target object (in our case, an area of a quarter on the center of the surface), this reward is calculated using: 
rdist=c^  exp(−αdist∑rf,lf,rh,lh∣∣mindist(Zl,pl)∣∣2)\large r_{dist} = \hat{c} \,\,  \text{exp} \left( -\alpha_{dist} \displaystyle \sum_{\text{rf,lf,rh,lh}} || \text{mindist}(Z_l, p_l) ||^2 \right)rdist​=c^exp​−αdist​rf,lf,rh,lh∑​∣∣mindist(Zl​,pl​)∣∣2​﻿
Alignment favors the alignment of the character and the object when in contact. If upelvis\large u_{\text{pelvis}}upelvis​﻿ is a unit vector along the frontal axis of the pelvis, this reward is calculated using:
ralign=c^  exp(−αalign∣∣1−upelvis⋅uobj∣∣2)\large r_{align} = \hat{c} \, \, \text{exp} \left( -\alpha_{\text{align}} || 1 - u_{\text{pelvis}} \cdot u_{\text{obj}} ||^2 \right)ralign​=c^exp(−αalign​∣∣1−upelvis​⋅uobj​∣∣2)﻿
Center of Mass: rcom\large r_{\text{com}}rcom​﻿ informs the suitability of the trajectory of the character. If dee\large d_{ee}dee​﻿ is the distance between the center of mass and the end-effector on landing time and dobj\large d_{obj}dobj​﻿ is the distance between the expected center of mass and the expected landing position, this reward is calculated using:
rcom=exp(−αcom∣∣dee−dobj∣∣2)\large r_{\text{com}} = \text{exp} \left( -\alpha_{\text{com}} || d_{ee} - d_{obj}||^2 \right)rcom​=exp(−αcom​∣∣dee​−dobj​∣∣2)﻿
The distance based reward rdist\large r_{\text{dist}}rdist​﻿ is non-zero only when the contact flag is on (c^=1\large \hat{c} = 1c^=1﻿).
If the simulated character lands perfectly tracks the reference pose and contact timing, dee\large d_{ee}dee​﻿ is equal to dobj\large d_{obj}dobj​﻿﻿
💡
Thus, the total Scene Loss is given by:
rscene=wdistrdist+walignralign+wcomrcom\huge r_{\text{scene}} = w_{\text{dist}}r_{\text{dist}} + w_{\text{align}}r_{\text{align}} + w_{\text{com}}r_{\text{com}}rscene​=wdist​rdist​+walign​ralign​+wcom​rcom​﻿
TrainingThe Value Function say V\large VV﻿ and the control policy π\large \piπ﻿ are represented by neural networks. During training, the experience tuples (st,at,st+1,r)\large (s_t, a_t, s_{t+1}, r)(st​,at​,st+1​,r)﻿ are collected and updated using Proximal Policy Optimization (PPO).
It is important to deal with the imbalance in the distribution of imbalance tuples because if the character tumbles and falls, the tuples are useless for learning thus, we eliminate the simulation before it reaches the end of the video if the character stumbles. The authors also monitor the CoM height, collision with obstacles, and the deviations from expected behavior, to decide if early termination is needed, which causes sample imbalance issues. To deal with this, we draw tuples uniformly in time.
Scene FittingThe computational cost of policy learning depends significantly on the length of the input video and the number of objects in the scene. The authors found that is not necessary to learn a single policy network over the whole video with tons of objects but rather to learn a sequence of policy networks over shorter time overlapping windows. While learning in a window, the configuration of the object is parameterized w.r.t the previous object. 
The authors train the policy of each window sequentially; once the value and policy networks for each window are learned, the local arrangement of three consecutive objects can be estimated from the networks by maximizing the cumulative view in the window.
Thus, the globally-consistent scene arrangement maximizes:
e∗=arg max⁡e∑n=1Nfn(e)fn(e)=Vπn∗(s(tn−1),e)−Vπn∗(s(tn),e)\huge \begin{array}{ll}
e^* &= \argmax_{e} \displaystyle \sum_{n=1}^{N} f_n(e) \\
f_n(e) &=V^{\pi^*_n} (s(t_{n-1}), e) -V^{\pi^*_n} (s(t_{n}), e) 
\end{array}e∗fn​(e)​=argmaxe​n=1∑N​fn​(e)=Vπn∗​(s(tn−1​),e)−Vπn∗​(s(tn​),e)​﻿
where:
 N\large NN﻿ is the number of windows
 fn(e)\large f_n(e)fn​(e)﻿ is a cumulative reward on the n\large nn﻿-th window
﻿πn∗\large \pi^*_nπn∗​﻿ is the optimal policy of the n\large nn﻿-th window
﻿Vπn∗\large V^{\pi^*_n}Vπn∗​﻿is the value function learnt along with the policy
﻿s(tn)\large s(t_n)s(tn​)﻿ is the initial state of the n\large nn﻿-th window
ReferencesIn this article we discussed "Human Dynamics from Monocular Video with Dynamic Camera Movements" published in SIGGRAPH Asia 2021, in which the authors tried to overcome the static view limitation of most previous methods and allow us to deal with dynamic view videos. 
This allows the camera to pan, tilt, and zoom to track the moving subject, and since the authors do not assume any limitations on camera movements, body translations and rotations from the video do not correspond to absolute positions in the reference frame, but inference is possible because human motion obeys the law of physics.
Recommended Reading
Block-NeRF: Scalable Large Scene Neural View Synthesis
Representing large city-scale environments spanning multiple blocks using Neural Radiance Fields
Generating Digital Painting Lighting Effects via RGB-space Geometry
Exploring the paper "Generating Digital Painting Lighting Effects via RGB-space Geometry" in which the authors propose an image processing algorithm to generate digital painting lighting effects from a single image.
EditGAN: High-Precision Semantic Image Editing
Robust and high-precision semantic image editing in real-time 
PoE-GAN: Generating Images from Multi-Modal Inputs
PoE-GAN is a recent, fascinating paper where the authors generate images from multiple inputs like text, style, segmentation, and sketch. We dig into the architecture, the underlying math, and of course, generate some images along the way. 
Extracting Triangular 3D Models, Materials, and Lighting From Images
In this article, we'll explore a novel and efficient approach for joint optimization of topology, materials, and lighting from multi-view image observations.
Barbershop: Hair Transfer with GAN-Based Image Compositing Using Segmentation Masks
A novel GAN-based optimization method for photo-realistic hairstyle transfer
﻿
﻿
Add a comment
Tags: Reinforcement Learning, TMP, Experiment
Iterate on AI agents and models faster. Try Weights & Biases today.
Human Dynamics From Monocular Video With Dynamic Camera Movements

Table of Contents (Click to Expand)

What is Pose Estimation?

Method

Human Dynamics

Quick Recap of Reinforcement Learning

Reward Design

Good Tracking of Estimated Poses $\large r_{track}$

Good Alignment with Scene Objects $\large r_{scene}$

Training

Scene Fitting

References

Recommended Reading

Human Dynamics From Monocular Video With Dynamic Camera Movements

Table of Contents (Click to Expand)

What is Pose Estimation?

Method

Human Dynamics

Quick Recap of Reinforcement Learning

Reward Design

Good Tracking of Estimated Poses rtrack\large r_{track} rtrack​﻿﻿

Good Alignment with Scene Objects rscene\large r_{scene}rscene​﻿﻿

Training

Scene Fitting

References

Recommended Reading

Good Tracking of Estimated Poses $\large r_{track}$

Good Alignment with Scene Objects $\large r_{scene}$