# Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation

- 3 mins

## Takeaways

• Uses images of the same person taken from multiple cameras over time to learn a latent variable body representation that captures the 3D geometry of the human body, without any 2D or 3D annotations. This representation encodes both 3D pose and appearance, and can be used to predict the 3D pose in a supervised manner without much labelled data.

• The latent representation is disentangled into separate representations of body’s 3D pose, geometry and apprearance.

• The latent code is augmented with view change information, the decoder learns to reconstruct the encoded image from a new perspective. Borrows ideas from Novel View Synthesis (NVS) - the task of creating realistic images from previously unseen viewpoints.

• No annotations required for latent representation; For 3D Pose estimation and on Human3.6M, performs better than fully-supervised methods when lesser annotated images are provided; improves over other semi-supervised methods while using as little as 1% annotated data.

### Constructing Geometry-Aware Latent Representation

With,

$L$ – Latent Representation

$L^{3D}, L^{app}, B$ – Latent Representations of body’s 3D pose, appearance and Background respectively.

$U = { (I_t^i, I_t^j) }_{t=1}^{N_u}$ – Set of $N_u$ image pairs, from i and j cameras at time t.

$R^{i \rightarrow j}$ – Rotation matrix from camera i coordinate system to camera j.

$E_{ \theta _e}$ and $D_{ \theta _d}$ - Encoder and Decoder parts on the network with $\theta _e$ and $\theta _d$ learnable parameters respectively.

Thus, we encode the image $I$ into a latent representation $L = E_{ \theta _e} (I)$ and reconstruct it back to $\hat{I} = D_{ \theta _d} (L)$ by minimizing ${(I - \hat{I})}_{L2}$ over training set $U$.

##### Encoding Geometry

With ${ (I_t^i, I_t^j) } \in U$ images from different viewpoints i and j and with the rotation matrix $R^{i \rightarrow j}$, the view-change information can be introduced as an additional input to the encoder and decoder and train them to decode $I_t^i$ and resynthesize $I_t^j$.

This models view-change as a 3D rotation by matrix multiplication of the encoder by the $R^{i \rightarrow j}$ before using this as an input to the decoder.

Formally, the auto-encoder $A_{\theta _e, \theta _d}$ outputs,

$A_{\theta _e, \theta _d} ( I_t^i, R^{i \rightarrow j} ) = D_{ \theta _d} (R^{i \rightarrow j} L_{i,t}^{3D})$, with $L_{i,t}^{3D} = E_{ \theta _e} (I_t^i)$

and is optimized by minimizing ${(A_{\theta _e, \theta _d} ( I_t^i, R^{i \rightarrow j} ) - I_t^i)}_{L2}$ over training set $U$.

The decoder doesn’t need to learn how to rotate the input to a new view but only how to decode the 3D latent vector $L^{3D}$. Being a $3 X N$ matrix, it can be understood as a set of $N$ 3D points, and can be mapped to $F: L^{3D} \rightarrow \mathbb{R}^{3K}$ with a different decoder to the 3D pose’s space via a semi-supervised setup.

##### Disentangling Appearance

Two frames $I_t$ and $I_{t'}$ of the same subject at different times t and t’ are trained simultaneously. The differences in the images are caused by 3D pose changes, latent space representations are swapped; i.e. decoder uses $L_t^{3D}$ and $L_t'^{app}$ to resynthesize frames t and $L_t'^{3D}$ and $L_t^{app}$ for frame t’. This results in $L^{3D}$ encoding pose and $L^{app}$ encoding appearance.

The encoder-decoder setup now becomes

##### Disentangling Background

By constructing $B_j$ background image (median of all images from a viewpoint) a direct connection to $B_j$ in the decoder is introduced with an additional $1 X 1$ convolutional layer to synthesize the decoded image.

##### Optimization

Sum of the per-pixel error $E_{\theta d, \theta e}$ is minimized, from a mini-batch $Z$ triplets of $(I_t^i, I_t^j, I_t'^k) \in U$ from individual sequences.

That is,