Generative Query Networks

- 9 mins

Understanding 3D scenes has been a major challenge in the field of computer vision in the past decade. With the development of Neural Networks, computing devices have excelled the task of “recognizing” elements of a scene supplied by its visual sensors, when provided with large, labelled data-sets. A recent article published by DeepMind, “Neural Scene Representation and Rendering” introduces Generative Query Networks (GQN), that learn the representation of the scene itself with no need for such labeled data. This form of representational learning paves the way towards fully unsupervised, autonomous scene understanding.

Scene understanding holds a great position in computer vision due to its real time perceiving, analyzing and elaborating an interpretation of dynamic scenes. Given a high-dimension input, the task is to extract a rich but compact representation that is easily accessible to subsequent processing stages. Typical outputs comprise semantic information or 3D information about the shape and pose of objects and layout elements in the scene. Modern computer vision systems consume large, labelled data-set to learn functions that generate such representations. This is achieved by labelling individual pixels or detecting bounding boxes. Aswe develop more complex machines that operate in the real world, we want them to fully understand their surroundings autonomously.

Generative Query Networks (GQNs) first use images taken from different viewpoints and create an abstract description of the scene, learning a representation of the scene itself. Based on this representation, GQN is able to predict what the scene would look like from a new, arbitrary viewpoint. Hypothesized to be instrumental, generative methods are desirable to create artificial systems that learn to represent scenes by modeling data that agents can directly obtain while processing the scenes themselves, and without recourse to semantic labels that would have to be provided by a human. In contrast, voxel or point-cloud based structure-from-motion methods employ linear representations and therefore typically scale poorly with scene complexity and size and are also comparatively difficult to apply to non-rigid objects.


In a GQN, an agent navigates a 3D scene and collects images from 2D viewpoints , which are collectively called observations, . The agent passes these observations to a GQN composed of two parts:

The two networks are trained end-to-end. Although the GQN training objective is intractable owing to the presence of latent variables, variational approximations are employed and are optimized with stochastic gradient decent.

The representational network is defined as

for a finite set of observations, by three possible choices of convolutional networks with different factorization and compositionality characteristics.

The generator, prior and inference models are represented as , and respectively, where is effectively a summary of the observations and is computed by the scene representation network.

Since the representation and generation networks are trained jointly, gradients from the generation network encourage the representation network to encode each observation independently in such a way that when they are summed element-wise, they form a valid scene representation.

The variational posterior density is also parametrised by a sequential neural network, one that shares some of its parameters with the generative network. That is, is a subset of . The generator network is represented as a sequence of computational cores with skip-connection pathways.

Inference and Optimization

The bound for both representation network and generation network can be decomposed into a reconstruction likelihood and a regularization term.

i.e., .

Optimization is performed via adaptive gradient descent. Each gradient step is computed by first sampling a mini-batch of scenes, each with a random number of observations(between and ) from the dataset.

It is experimentally observed that deeper models have higher likelihood, and not sharing weights of cores improves the overall performance of the network across generation steps.


To verify the compositionally of shapes, colours and positions, arithmetic operations are performed in representation space, adding and subtracting representations to generate representations which should modify an object in a predictable fashion. It is verified that scene algebra generates the correct object modifications for a variety of object properties, and is able to recombine properties even across object positions. However, it is observed that scene algebra is not supported across scenes with different sets of views, nor can it add scenes with different objects together.

In this experiment, a 9-joint robotic arm is placed in the middle of the room along with one spherical target object. A reinforcement learning task is defined as the decreasing function of distance between the target and the arm (i.e. closer the target, higher the reward).

Two networks are trained with randomly choosing a different texture for the walls and floor, color and texture of the target object and different joint angles of the arm from a fixed pool of options. The first network is a pre-trained GQN modelled on the scenes of the room containing the Jaco arm. This model is used to train the second network over the reinforcement learning task separately.

It is verified that training using the GQN representation, by comparison, is significantly more robust to the choice of hyperparameters. This is because the sensitivity of the agent to the choice of hyper-parameters when the representation is learned from scratch, and only using RL.

Random 7x7 mazes are generated using an OpenGL-based Deep- Mind Lab game engine. Random viewpoints taken from scenes are used to train the GQN and a top-down render is used to verify the uncertainty of the representation. Predicted uncertainty is measured by computing the model’s predicted information gain at each location, averaged over 3 different heading directions. With only a handful of first-person observations, the model is capable of predicting the top-down view with high accuracy, indicating successful integration of egocentric observations. Errors often correspond to the precise points at which corridors connect with rooms.

Shepard and Metzler’s mental rotation objects were initially created for cognitive tests to visualize a mental image of an object from all directions. In these experiments, each object is composed of 7 randomly coloured cubes that are positioned by a self-avoiding random walk in 3D grid. The generator is capable of predicting accurate images from arbitrary query viewpoints. This implies that the scene description accurately captures the positions and colours of multiple parts that make up the object. The model’s predictions are consistent with occlusion, lighting and shading.


Parallels and Comparisions

GQN learns the representational space, allowing it to express the presence of textures, parts, objects, lights and scenes concisely and at a suitably high level of abstraction.

In contrast, traditional structure-from-motion, structure-from depth and multi-view geometry techniques prescribe the way in which the 3D structure of the environment is represented (for instance as point clouds, mesh clouds or a collection of pre-defined primitives). Whereas GQN enables task-specific fine-tuning of the representation itself.

Other neural approaches (auto-encoders etc) focus on regularities in colors and patches in the image space, but fail to achieve high-level representation.

On the other hand, GQN can account for uncertainty in scenes with high occlusions.

Also, GQN is not specific to particular choice of generation architecture and doesn’t require problem-specific engineering in the design of their generators. Alternatives such as generative adversarial networks (GANs) or auto-regressive models could be employed.


A major restriction of GQN (especially when compared with traditional representations) is that the resulting representations are no longer interpretable.

Also, GQN is not tested over dynamic or interactive scenes. Limited availability of suitable real datasets and a need for controlled analysis resulted in training and testing of GQNs on static, synthetic and non-interactive environments only.

Eslami et al. developed an artificial vision system, the Generative Query Network (GQN) – a single architecture to perceive, interpret and represent synthetic scenes without human labelling paving the way towards fully unsupervised scene understanding, planning and behaviour. Future work of GQNs could be in predictive driving for autonomous vehicles, prespective rendering for VR/AR and GQN based SLAM.


[*] “Neural Representation and Rendering” - This work was done by S. M. Ali Eslami, Danilo J. Rezende, Frederic Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil Rabinowitz, Helen King, Chloe Hillier, Matt Botvinick, Daan Wierstra, Koray Kavukcuoglu and Demis Hassabis.

comments powered by Disqus
rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora