Skip to main content

Human Dynamics From Monocular Video With Dynamic Camera Movements

This article provides an in-depth overview of the paper "Human Dynamics from Monocular Video with Dynamic Camera Movements" by Ri Yu, Hwangpil Park, and Jehee Lee.
Created on May 14|Last edited on February 3




In this article, we take an in-depth look into the paper "Human Dynamics from Monocular Video with Dynamic Camera Movements" written by Ri Yu, Hwangpil Park, and Jehee Lee. We will begin with an explanation of pose estimation before outlining our method and diving into the topic of human dynamics.
Here's what we'll be covering:

Table of Contents (Click to Expand)



Let's dive in!

What is Pose Estimation?

Pose Estimation is the general task of estimating orientations and poses of various objects based on localizing joints (articulations). You'll frequently see Human Pose Estimation in real-world use cases like gaming and animation. However, it can be an incredibly hard problem to tackle. Why? Briefly:
  • Strong occlusions by other body parts
  • Small and barely visible joints lead to invisible key points
  • Variants due to clothing, body type, and lighting
  • A general need for "holistic" (global) context
  • Large space of possible poses owing to the high degree-of-freedom (DOF) possible configurations of the human body
Deep Learning was successfully applied for Pose Estimation by the likes of Jain et al. (2013) and Toshev et al. (2014) using various training paradigms based on Convolutional Neural Networks (CNNs). This improved upon the older "part-based models" (joint-specific models based on extreme pre-processing and feature engineering), which lacked efficiency and global context (only modeled certain joints without context to overall pose).
Follow-up work such as the Cascaded Pyramid Network by Chen et al. (2018) improved upon these methods enabling multi-person pose estimation, following the modern two-stage pipeline of first creating human bounding boxes based on a detecter and then using a network for keypoint localization (creating a skeleton) based on these bounding boxes.
While 2D pose estimation is the task of estimating a 2-dimensional pose (x,y)\large (x,y), 3D pose estimation is a more challenging problem that incorporates the depth factor as well, i.e., estimating 3-dimensional poses (x,y,z)\large (x, y, z). A lot of important work in 3D pose estimation relies on global context using methodologies like skeleton fitting, temporal networks, human-object interactions, and adversarial priors.
For a long time, optical motion-capture-based datasets were used to acquire high-quality human motion data, albeit being extremely costly and hard to set up. On the other hand, videos are a readily available source of video-based human motion, which are easily accessible and extremely low cost.
However, most work in video-based pose estimation has been based on the assumption that the video is a static shot with a fixed camera setting (monocular) and has even exploited this known setting and calibration.

Method

Figure 1: Overview of the System. (Source: Figure 2 from the paper)
The pipeline takes a video clip and a simulated character model as input, along with an environmental model consisting of geometric primitives. This model helps estimate the scene geometry arrangement while reconstructing 3D human interactions with the primitives.
There a total of 5 models involved:
  • 2D Pose Estimator: The authors use pre-trained models from the OpenPose Framework for predicting 2D poses. Although these models were pre-trained from large training datasets, they differ from the dynamic motions we aim to reconstruct (dancing, parkour) and, therefore, sometimes fail. Therefore to overcome this, the 2D pose estimator is fed over 0,90,180,270\large 0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ} rotated version of each frame to select the best pose output based on temporal coherence and confidence values.
  • 3D Pose Estimator: The authors use pre-trained models from VIBE for reconstructing 3D poses. Whenever 3D pose estimation fails, the missing poses are filled by linear interpolation. These estimated 3D poses are converted into body representations in a format suited for physics simulation by inverse kinematics.
  • Contact Estimator: Since these methods rely heavily on learning human dynamics based on physics priors, contact is an important source of information that will help us learn interactions. Based on prior work by Rempe et al., which developed networks to detect foot-ground contact from monocular video assuming a static view, the authors modified the system to work with dynamic-view videos. Using this network to get rendered images containing labels for contacts based on the height and velocity of the foot, a network is trained to estimate the contact in a supervised manner. The results from the 2D Pose Estimator are used as the input, which labels a binary value at each foot node (the pose estimator essentially outputs a graph consisting of links and nodes). The hand-object contact information, however, is manually given.
  • Policy Learner: We'll get into this shortly.
  • Scene Geometry Builder: This one too.

Human Dynamics

The dynamics model of our humanoid is a figure consisting of links and joints controlled by Proportional-Derivative (PD) servos. Our aim is to learn a control policy (controller) that learns to mimic 3D human movements in the video based on the estimated 2D and 3D poses, contact labels, and object interaction hints provided earlier in the pipeline.
If π(atst,e)\large \pi(a_t | s_t, e) os the control policy, where at\large a_t is the action and st\large s_t is the state of the dynamics model at some given time t\large t in some environment e\large e, then the performed action will provide a target pose qtˉ\large \bar{q_t} for PD control at the time t\large t of the form:
qtˉ=qt^+at\huge \bar{q_t} = \hat{q_t} + a_t

where q^\large \hat{q} is a reference pose made from all joint orientations from 3D pose estimation?
The state is defined as an aggregated vector of the form:
st=[ϕt,pt,Rt,vt,ωt]\huge s_t = [ \phi_t, p_t, R_t, v_t, \omega_t]

where :-
  • ϕt[0,1]\large \phi_t \in [0,1] represents the time normalized by clip length
  • pt\large p_t represents the positions of body links
  • Rt\large R_t represents the orientations by unit quaternions of body links
  • vt\large v_t represents the linear velocity
  • wt\large w_t represents the angular velocity
The environment e\large e is an aggregated vector consisting of the height, size and location of objects which come in handy during the scene fitting process.

Quick Recap of Reinforcement Learning

Let's do a quick overview of a discrete-time dynamics system using the deep reinforcement learning paradigm.
💡
We search for a control policy say π\large \pi that maximizes the discounted cumulative rewards Vπ(s)\large V^{\pi}(s). At any given time t\large t, the agent in some state st\large s_t will take some action at\large a_t. The environment e\large e as a consequence responds to the said action by changing its state to st+1\large s_{t+1} and giving a reward of rt\large r_t. Similarly the discounted cumulative rewards of the entire policy over all rewards with discount factors γ\large \gamma is given by:
Vπ(st)=Es0=st,aiπ[i=0γiri]γ[0,1)\huge V^{\pi} (s_t) = \mathbb{E}_{s_0 = s_t, a_i \sim \pi} \left[\displaystyle \sum_{i=0}^{\infty} \gamma^{i} r_i \right] \hspace{2em} \gamma \in [0, 1)

And our optimal control policy would be:
π=arg maxπVπ\huge \pi^{*} = \argmax_{\pi} V^{\pi}


Reward Design

Most RL Algorithms follow this general framework, and the key differentiating success factor is the reward design. For this paper, the reward is quite intricate, consisting of many sub-terms. Let's have a closer look:

Good Tracking of Estimated Poses rtrack\large r_{track} 

This reward encourages the accurate tracking of poses and consists of 5 sub-terms
  1. Pose: rq\large r_q favors a good match between estimated and simulated poses. If q^\large \hat{q} is a vector of joint angles of the estimated model and q\large q is a vector of joint angles of the simulated model, this reward is calculated using:
    rq=exp(αqq^q2)\large r_q = exp( -\alpha_q || \hat{q} -q ||^{2})
    
  2. Velocity: rv\large r_v favors a good match between estimated and simulated velocities. If v\large v is a vector of joint velocities calculated by finite difference. This reward is calculated using:
    rv=exp(αvv^v2)\large r_v = exp(-\alpha_v || \hat{v} - v ||^2)
    
  3. Body Orientation: rori\large r_{ori} favors a good match between body vectors estimated from video and simulation. Assuming a steady up-direction (throughout the video there is only 1 up direction), we use the angle θ^\large \hat{\theta} between the view up-vector vY\large v_Y(in our case a 2D vector pointing upward in the image-plane) and body up-vector vb\large v_b in the image plane (in our case a 2D vector from the pelvis to the neck obtained from OpenPose) to act as a sort of hint for the body up-vector in 3D space. θ\large \theta is the angle between the vector from the position of the pelvis joint to that of the head joint of the character and the y-axis in 3D space. This reward is calculated using:
    rori=exp(αoriθ^θ2)\large r_{ori} = exp (-\alpha _{ori} || \hat{\theta} - \theta ||^2)
    
  4. Contact: rcontact\large r_{contact} favors a good match between estimated contact states c^\large \hat{c} and simulated contact states c\large c (in our case we only consider the four end-effectors i.e. hands and feet). As aforementioned in the article these states are binary flags and are calculated using:
    rcontact=exp(αcl{rf,lf,rh,lh}xor(cl^,cl)2)\large r_{contact} = exp \left( - \alpha_c \displaystyle \sum_{l \in \{ rf,lf,rh,lh \}} || \text{xor}(\hat{c_l}, c_l) ||^2 \right)
    
  5. Regularization: rreg\large r_{reg} this is a term that minimizes joint torques using standard L2 regularization to prevent excessive force and unnecessary movements. If τ\large \tau is an aggregated vector of torques, this reward is calculated using:
    rreg=exp(αregτ2)\large r_{reg} = exp \left( -\alpha_{reg} ||\tau||^2 \right)
    
Thus, the total Tracking Loss is given by:
rtrack=wqrq+wvrv+worirori+wcontactrcontact+wregrreg\huge r_{\text{track}} = w_qr_q + w_vr_v + w_{\text{ori}}r_{\text{ori}} + w_{\text{contact}}r_{\text{contact}} + w_{\text{reg}}r_{\text{reg}}



For a closer look into regularization, refer to these reports:


Good Alignment with Scene Objects rscene\large r_{scene}

Figure 2: rscene\large r_{\text{scene}} reward terms, namely Distance (left), Alignment (middle) and Center of Mass (right)
This reward encourages good alignment with the overall scene and consists of 3 sub-terms
  1. Distance: rdist\large r_{dist} ensures that the simulated end-effector (hands and feet) are in contact with the desired object. If pl\large p_l is the position of the end-effector and Zl\large Z_l is the target zone of the target object (in our case, an area of a quarter on the center of the surface), this reward is calculated using:
    rdist=c^exp(αdistrf,lf,rh,lhmindist(Zl,pl)2)\large r_{dist} = \hat{c} \,\, \text{exp} \left( -\alpha_{dist} \displaystyle \sum_{\text{rf,lf,rh,lh}} || \text{mindist}(Z_l, p_l) ||^2 \right)
    
  2. Alignment favors the alignment of the character and the object when in contact. If upelvis\large u_{\text{pelvis}} is a unit vector along the frontal axis of the pelvis, this reward is calculated using:
    ralign=c^exp(αalign1upelvisuobj2)\large r_{align} = \hat{c} \, \, \text{exp} \left( -\alpha_{\text{align}} || 1 - u_{\text{pelvis}} \cdot u_{\text{obj}} ||^2 \right)
    
  3. Center of Mass: rcom\large r_{\text{com}} informs the suitability of the trajectory of the character. If dee\large d_{ee} is the distance between the center of mass and the end-effector on landing time and dobj\large d_{obj} is the distance between the expected center of mass and the expected landing position, this reward is calculated using:
    rcom=exp(αcomdeedobj2)\large r_{\text{com}} = \text{exp} \left( -\alpha_{\text{com}} || d_{ee} - d_{obj}||^2 \right)
    
The distance based reward rdist\large r_{\text{dist}} is non-zero only when the contact flag is on (c^=1\large \hat{c} = 1).
If the simulated character lands perfectly tracks the reference pose and contact timing, dee\large d_{ee} is equal to dobj\large d_{obj}
💡
Thus, the total Scene Loss is given by:
rscene=wdistrdist+walignralign+wcomrcom\huge r_{\text{scene}} = w_{\text{dist}}r_{\text{dist}} + w_{\text{align}}r_{\text{align}} + w_{\text{com}}r_{\text{com}}


Training

The Value Function say V\large V and the control policy π\large \pi are represented by neural networks. During training, the experience tuples (st,at,st+1,r)\large (s_t, a_t, s_{t+1}, r) are collected and updated using Proximal Policy Optimization (PPO).
It is important to deal with the imbalance in the distribution of imbalance tuples because if the character tumbles and falls, the tuples are useless for learning thus, we eliminate the simulation before it reaches the end of the video if the character stumbles. The authors also monitor the CoM height, collision with obstacles, and the deviations from expected behavior, to decide if early termination is needed, which causes sample imbalance issues. To deal with this, we draw tuples uniformly in time.

Scene Fitting

The computational cost of policy learning depends significantly on the length of the input video and the number of objects in the scene. The authors found that is not necessary to learn a single policy network over the whole video with tons of objects but rather to learn a sequence of policy networks over shorter time overlapping windows. While learning in a window, the configuration of the object is parameterized w.r.t the previous object.
The authors train the policy of each window sequentially; once the value and policy networks for each window are learned, the local arrangement of three consecutive objects can be estimated from the networks by maximizing the cumulative view in the window.
Thus, the globally-consistent scene arrangement maximizes:
e=arg maxen=1Nfn(e)fn(e)=Vπn(s(tn1),e)Vπn(s(tn),e)\huge \begin{array}{ll} e^* &= \argmax_{e} \displaystyle \sum_{n=1}^{N} f_n(e) \\ f_n(e) &=V^{\pi^*_n} (s(t_{n-1}), e) -V^{\pi^*_n} (s(t_{n}), e) \end{array}

where:
  • N\large N is the number of windows
  • fn(e)\large f_n(e) is a cumulative reward on the n\large n-th window
  • πn\large \pi^*_n is the optimal policy of the n\large n-th window
  • Vπn\large V^{\pi^*_n}is the value function learnt along with the policy
  • s(tn)\large s(t_n) is the initial state of the n\large n-th window

References

In this article we discussed "Human Dynamics from Monocular Video with Dynamic Camera Movements" published in SIGGRAPH Asia 2021, in which the authors tried to overcome the static view limitation of most previous methods and allow us to deal with dynamic view videos.
This allows the camera to pan, tilt, and zoom to track the moving subject, and since the authors do not assume any limitations on camera movements, body translations and rotations from the video do not correspond to absolute positions in the reference frame, but inference is possible because human motion obeys the law of physics.

Iterate on AI agents and models faster. Try Weights & Biases today.