DeepMVS - Learning Multi View Stereopsis

- 2 mins

Published in CVPR 2018, this paper presents a learning based method for multi-view stereo (MVS) that is independent on the order of the input images, and uses a specialised encoder-decoder based architecture for aggregating features across spatial regions. The paper also introduces MVS-Synth dataset, a set of photorealistic synthetic sequences for improved disparity prediction.

Link to the paper

Multi view stereo

Multiple View Stereo reconstruction re-constructs 3D model from multiple RGB images with no depth information. These images can be either be captured simultaneously by different cameras or can be sequentially acquired via a moving camera. The intensics-extrensics are usually obtained via SFM algorithms and disparity maps are generated via variants of patch matching algorithms, usually based on photometic consistency cost functions. Reconstuction is conventionally done by rectifying image pairs, computing the matching pixel pairs using cost functions and finally, triangulation.

After calculating the internsics-externsics (if needed) with a standard SFM, the paper describes a model called DeepMVS that produces disparity maps directly from an arbitrary set of posed images by:

The network is trained over a sequence of photorealistic synthetic images to improve quality of disparity prediction.

Generating volume plane-sweeps

For a reference image, a set of neighbouring images (assigned based on matching features) are used to obtain plane-sweeps with a predetermied disparity level.

Disparity levels in the space are step-based and maxed at the estimated maximum disparity of the reference image. A stack of plane-sweeps are generated for a set of neighbouring images with their reference image.

Learning Architecture

The architecture consists of:

A patch matching network: to extract per-pixel features over all plane-swept images.

For each of the NxD plane-swept images, a 64-channel features from reference Image I(r) and a single patch from Plane sweep V(n,d) are extracted. These features are then concatenated and and passed through 3 more convolutional layers before turning into 4-channel patch matching features.

An intra-volume feature-aggregation network: to pass the features to larger spatial regions and enable the network to make predictions with non-local information.

A 4xD-channel volume is generated by concatenating 4-channel patch matching features of neighbouring images and then passed though a U-Net with decoder pretrained into a VGG-19 on ImageNet. This results in an 800-channel volume with the predicted disparity of a neighbour.

An inter-volume feature-aggregation network: to predict the final disparity by max-pooling.

By randomly selecting the number of neighbouring images N from {1,2,3,4} final disparity is predicted by element-wise max-pooling.


On the ETH3D benchmark dataset, DeepMVS outperforms DeMoN in the setting of multi-view stereo, and achieves competitive performance with COLMAP, the state-of-the-art among conventional MVS algorithms. DeepMVS is often able to produce correct disparities in poorly textured regions, such as sky, walls, floors, and desk surfaces, where conventional algorithms fail.

comments powered by Disqus
rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora