Learning to Reconstruct People

- 17 mins

Reconstructing human body shapes from a single image or a video stream is a challenging problem. The reconstruction needs to be accurate such that automated body measurements agree with the real ones, and needs to be practical such that it is fast and utilizes as few sensors as possible. This problem partially relates to the growth in demand of applications such as telepresence, virtual and augmented reality, cinematography, virtual try-on, gaming and body health monitoring.

There has been steady research in the area of retrieving human body shape and pose from image/video, especially with the advent of deep neural networks. This can be attributed to research in computing, computer graphics, and deep learning techniques.

This blogpost is divided into

The intension of this blogpost is to accumulute the current research in recovering human shape from images/video in one place and possibly provide a primer for someone interested in the field.

And hence,

This post would be updated constantly.

Before diving into the research paper summaries, it’s useful to discuss how a “mean template” or a deformable model (the term 3DMM is usually used for a face template model in literature) is built.


3D Morphable Models

Large-scale face model (LSFM)

3D Morphable models (3DMMs) are powerful 3D statistical models, usually of human face shape and texture. A 3DMM is constructed by performing some form of dimensionality reduction, typically PCA, on a training set of meshes, in dense correspondence with one another (vertices are made consistent across all meshes).

Thus, 3DMMs are usually constructed by first establishing group-wise dense correspondences between a set of meshes, and then performing some kind of statistical analysis on the registered data to produce a low-dimensional model.

Establishing dense-correspondence is a challenging problem and usually needs some kind of soft constraint, typically guided by landmarks to perform an optimal similarity alignment between the mesh and the (annotated) template. Non-rigid iterative closest point (NICP) is then usually performed to deform the template so that it takes the shape of the input mesh, with the landmarks acting as a soft constraint. This is susceptible to failure cases because both landmark localization and NICP are non-convex optimization problems that are sensitive to initialization.

Nevertheless, 3DMMs are powerful priors of a 3D shape that can be leveraged in fitting algorithms to reconstruct accurate and complete 3D representations from data-deficient sources (for eg, noisy depth scans). Any (new) input 3D mesh can thus be projected on the model subspace by finding the shape vector that generates a shape instance that is as close as possible to the (registered) canonical model.

3DMMs also provides a mechanism to encode any 3D shape in a low dimensional feature space, a compact representation that makes tractable many 3D shape analysis problems.

Blend Skinning and Blend Shapes

(Linear) blend skinning (LBS) is the idea of transforming vertices inside a single mesh by a (blend) of multiple transforms i.e each vertex in the mesh surface is transformed using a weighted influence of its neighboring bones. Blend skinning is thus attaching the surface of a mesh to an underlying “skeletal” structure. There has been a lot of research on automatically rigging LBS models, taking a collection of meshes and inferring the bones, joints and blend weights. The problem of this method is that the models do not span a space of body shapes and often produce unnatural results.

Rather than skeleton subspace deformation methods (aka blend skinning), pose shape deformation model (PSD) defines deformations relative to a base shape, where these deformations are a function of the articulated pose. Pose-dependent deformations can also be described in terms of the coefficients of the basis vectors i.e. learning a low-dimensional PCA basis for each joint’s deformations. There has been a lot of research in learning an efficient linear and realistic model from example meshes. “Skinned Multi-Person Linear” (SMPL), however, aims to have a realistic poseable model that covers the space of human shape variation.


SMPL possibly makes the human body model as simple and standard as possible. SMPL model can realistically represent a wide range of human body shapes, can be posed with natural pose-dependent deformations, exhibits soft-tissue dynamics, is efficient to animate, and is compatible with existing rendering engines.

The resultant model maps shape and pose params to vertices. Given a particular skinning method SMPL’s goal is to learn to correct for limitations of the method to model training meshes. For more details please refer to MPI SMPL2015.

A key component of this model is that the pose blend shapes are formulated as a linear function of the elements of the part rotation matrices. On top of that, with the low polygon count, a simple vertex topology (for both men and women), clean quad structure, and a standard rig, SMPL makes a realistic learned model accessible to animators.

SMPL decomposes body shape into identity-dependent shape and non-rigid pose-dependent shape based on a vertex-based skinning approach that uses corrective blend shapes (blend shape represented as a vector of concatenated vertex offsets).


Training systems capable of solving complex 3D vision tasks most often require large quantities of data. As labeling data is a costly and complex process, it is important to have mechanisms to design machine learning models that can comprehend the 3D world while being trained without much supervision. 3D shape ground truth is usually either limited or hard to obtain. In the case of human shape reconstruction, SMPL tends to be a popular choice since the representation generates high-quality 3D meshes while the system estimates only a handful of parameters i.e., 72 pose params and 10 shape params. This model also allows optimizing directly for the surface by using a 3D per-vertex loss.

Major datasets used are related to either 3D pose estimation or ones with (dense) 2d-3d (image-surface) correspondences.

HumanEva (2009)

HumanEva-I contains calibrated video sequences (4 people, 6 common indoor poses) that are synchronized with 3D body poses. The error metrics for computing error in 2D and 3D pose are also provided. HumanEva-II uses an additional camera, as well as multi-action scenarios including running around a loop performing actions.

Human3.6m (2014)

Human3.6m: A real image showing multiple people in different poses (left), and a matching sample of actors in similar poses (middle) together with their reconstructed 3D poses from the dataset, displayed using a synthetic 3D model (right).

Human3.6M contains 3.6 million (11 people, 17 common indoor poses) frames of 3D human poses. The dataset together with code for the associated large-scale learning models, features, visualization tools, mixed reality augmentations as well as the evaluation server, is available online.

SURREAL (2017)

SURREAL data generation pipeline

SURREAL (Synthetic hUmans foR REAL tasks), built upon Human3.6m, possibly provides one of the most diverse data generation pipelines for training human shape recovery models. Images (~6.5 million) in SURREAL are rendered from 3D sequences of MoCap data. The SMPL model is used to decompose the body into pose and shape parameters, to sample these independently to produce an image, by random sampling environments. The generated RGB images are accompanied with 2D/3D poses, surface normal, optical flow, depth images, and body-part segmentation maps.

UP-3D (2017)

UP-3D (Unite the People) is an “in-the-wild” dataset with a high-quality 3D body model fits for multiple human pose datasets with a human in the loop (only sort good and bad fits). It was demonstrated that training this pose estimator on the full 91 keypoint dataset helps to improve the state-of-the-art for 3D human pose estimation on the two popular benchmark datasets HumanEva and Human3.6M.

DensePose (2018)

DensePose (by Facebook AI Research) contains manually annotated 50K COCO image-to-surface correspondences (~ 5 million) built via a specific multi-stage annotation pipeline. Densepose also proposes a variant of Mask-RCNN to densely regress part-specific UV coordinates within every human region.

JTA (2018)

JTA (Joint Track Auto) is a huge dataset for pedestrian pose estimation and tracking in urban scenarios created by exploiting the highly photorealistic video game Grand Theft Auto V developed by Rockstar North. We collected a set of 512 full-HD videos (256 for training and 256 for testing), 30 seconds long, recorded at 30 fps.

3DPW (2018)

3DPW is possibly the first dataset in the wild with accurate 3D poses for evaluation (60 in the wild videos with 2D and 3D pose annotation, including camera pose and scanned models with different clothing variations). With each sequence contains its corresponding models, the poses are estimated using IMUs and a handheld 2d video.


Here are a few tools that might come in handy for working with human body recovery models. Thus, tools like Blender are not listed.


Menpo is an open-source software framework for constructing and fitting visual deformable models written in Python. The Menpo project contains the associated tooling that provides end-to-end solutions for 2D and 3D deformable modeling with support for various training and fitting algorithms for deformable modeling. The framework also comes with a tool for bulk annotation for model training as well as pre-trained landmark localization models.


Announced recently (May 2019), Tensorflow Graphics is an add-on Tensorflow library with differentiable geometry and graphics layers as well as 3D viewer functionalities (Tensorboard 3D).

This Colab Notebook is a great primer on how to use TF-Graphics. Check out https://github.com/tensorflow/graphics for more details and examples.

tensorboard 3d support

Current Literature Takeaways

Human shape from silhouettes using generative hks descriptors and cross-modal neural networks (2017)

Link to the paper

End-to-end recovery of human shape and pose (2018)

Link to the paper

Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation (2018)

Link to the paper

Learning to estimate 3D human pose and shape from a single color image (2018)

Link to the paper

Video-Based Reconstruction of 3D people models (2018)

Link to the paper

Detailed Human Avatars from Mono. Video (2018)

Link to the paper

BodyNet: Volumetric Inference of 3D Human Body Shapes (2018)

Link to the paper

Learning to Reconstruct People in Clothing from a Single RGB Camera (2019)

Link to the paper

DenseBody: Directly Regressing Dense 3D Human Pose and Shape From a Single Color Image (2019)

Link to the paper

Authors of the respective papers and articles reserve the rights to images and content. Comments, suggestions, and improvements are welcome. Please e-mail me at s.[lastname]@tum.de.

comments powered by Disqus
rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora