Uses images of the same person taken from multiple cameras over time to learn a latent variable body representation that captures the 3D geometry of the human body, without any 2D or 3D annotations. This representation encodes both 3D pose and appearance, and can be used to predict the 3D pose in a supervised manner without much labelled data.
The latent representation is disentangled into separate representations of body’s 3D pose, geometry and apprearance.
The latent code is augmented with view change information, the decoder learns to reconstruct the encoded image from a new perspective. Borrows ideas from Novel View Synthesis (NVS) - the task of creating realistic images from previously unseen viewpoints.
No annotations required for latent representation; For 3D Pose estimation and on Human3.6M, performs better than fully-supervised methods when lesser annotated images are provided; improves over other semi-supervised methods while using as little as 1% annotated data.
– Latent Representation
– Latent Representations of body’s 3D pose, appearance and Background respectively.
– Set of image pairs, from i and j cameras at time t.
– Rotation matrix from camera i coordinate system to camera j.
and - Encoder and Decoder parts on the network with and learnable parameters respectively.
Thus, we encode the image into a latent representation and reconstruct it back to by minimizing over training set .
With images from different viewpoints i and j and with the rotation matrix , the view-change information can be introduced as an additional input to the encoder and decoder and train them to decode and resynthesize .
This models view-change as a 3D rotation by matrix multiplication of the encoder by the before using this as an input to the decoder.
Formally, the auto-encoder outputs,
and is optimized by minimizing over training set .
The decoder doesn’t need to learn how to rotate the input to a new view but only how to decode the 3D latent vector . Being a matrix, it can be understood as a set of 3D points, and can be mapped to with a different decoder to the 3D pose’s space via a semi-supervised setup.
Two frames and of the same subject at different times t and t’ are trained simultaneously. The differences in the images are caused by 3D pose changes, latent space representations are swapped; i.e. decoder uses and to resynthesize frames t and and for frame t’. This results in encoding pose and encoding appearance.
The encoder-decoder setup now becomes
By constructing background image (median of all images from a viewpoint) a direct connection to in the decoder is introduced with an additional convolutional layer to synthesize the decoded image.
Sum of the per-pixel error is minimized, from a mini-batch triplets of from individual sequences.