# Generative Query Networks

- 9 mins

Understanding 3D scenes has been a major challenge in the field of computer vision in the past decade. With the development of Neural Networks, computing devices have excelled the task of “recognizing” elements of a scene supplied by its visual sensors, when provided with large, labelled data-sets. A recent article published by DeepMind, “Neural Scene Representation and Rendering” $^*$ introduces Generative Query Networks (GQN), that learn the representation of the scene itself with no need for such labeled data. This form of representational learning paves the way towards fully unsupervised, autonomous scene understanding.

Scene understanding holds a great position in computer vision due to its real time perceiving, analyzing and elaborating an interpretation of dynamic scenes. Given a high-dimension input, the task is to extract a rich but compact representation that is easily accessible to subsequent processing stages. Typical outputs comprise semantic information or 3D information about the shape and pose of objects and layout elements in the scene. Modern computer vision systems consume large, labelled data-set to learn functions that generate such representations. This is achieved by labelling individual pixels or detecting bounding boxes. Aswe develop more complex machines that operate in the real world, we want them to fully understand their surroundings autonomously.

Generative Query Networks (GQNs) first use images taken from different viewpoints and create an abstract description of the scene, learning a representation of the scene itself. Based on this representation, GQN is able to predict what the scene would look like from a new, arbitrary viewpoint. Hypothesized to be instrumental, generative methods are desirable to create artificial systems that learn to represent scenes by modeling data that agents can directly obtain while processing the scenes themselves, and without recourse to semantic labels that would have to be provided by a human. In contrast, voxel or point-cloud based structure-from-motion methods employ linear representations and therefore typically scale poorly with scene complexity and size and are also comparatively difficult to apply to non-rigid objects.

### Architecture

In a GQN, an agent navigates a 3D scene $i$ and collects $K$ images $x_i^k$ from 2D viewpoints $v_i^k$ , which are collectively called observations, $o_i = {(x_i^k, v_i^k)}_{k=1,...K}$. The agent passes these observations to a GQN composed of two parts:

• a representation network $f$ takes an input agent’s observations and produces a neural scene representation $r$, which encodes information about the underlying scene. For each new $o, r$ accumulates further evidence about the contents of the scene. i.e.,
• a generation network д then predicts the scene from an arbitrary query viewpoint $v_q$, using stochastic latent variables $z$ to create variability in the outputs where necessary. i.e.,

The two networks are trained end-to-end. Although the GQN training objective is intractable owing to the presence of latent variables, variational approximations are employed and are optimized with stochastic gradient decent.

The representational network $f (X^{1, . . .,M},v^{1, . . .,M})$ is defined as

for a finite set of $M$ observations, by three possible choices of convolutional networks $\psi(x^k, \widehat v ^k)$ with different factorization and compositionality characteristics.

The generator, prior and inference models are represented as $q_θ (x |z,v^q, r )$, $q_θ (z|v^q, r )$ and $q_\psi (z|x^q,v^q, r )$ respectively, where $r = f (X^{1, . . .,M},v^{1, . . .,M})$ is effectively a summary of the observations and is computed by the scene representation network.

Since the representation and generation networks are trained jointly, gradients from the generation network encourage the representation network to encode each observation independently in such a way that when they are summed element-wise, they form a valid scene representation.

The variational posterior density $q_\psi (z|x^q,v^q, r )$ is also parametrised by a sequential neural network, one that shares some of its parameters with the generative network. That is, $θ$ is a subset of $\psi$ . The generator network is represented as a sequence of computational cores with skip-connection pathways.

### Inference and Optimization

The bound for both representation network $r$ and generation network $q$ can be decomposed into a reconstruction likelihood and a regularization term.

i.e., $F (θ,ψ) = E(x,v) (reconstruction likelihood + regularization term)$.

Optimization is performed via adaptive gradient descent. Each gradient step is computed by first sampling a mini-batch of scenes, each with a random number of observations(between $0$ and $K$) from the dataset.

It is experimentally observed that deeper models have higher likelihood, and not sharing weights of cores improves the overall performance of the network across generation steps.

### Experiments

• Scene Algebra

To verify the compositionally of shapes, colours and positions, arithmetic operations are performed in representation space, adding and subtracting representations to generate representations which should modify an object in a predictable fashion. It is verified that scene algebra generates the correct object modifications for a variety of object properties, and is able to recombine properties even across object positions. However, it is observed that scene algebra is not supported across scenes with different sets of views, nor can it add scenes with different objects together.

• Control of Robotic Arm

In this experiment, a 9-joint robotic arm is placed in the middle of the room along with one spherical target object. A reinforcement learning task is defined as the decreasing function of distance between the target and the arm (i.e. closer the target, higher the reward).

Two networks are trained with randomly choosing a different texture for the walls and floor, color and texture of the target object and different joint angles of the arm from a fixed pool of options. The first network is a pre-trained GQN modelled on the scenes of the room containing the Jaco arm. This model is used to train the second network over the reinforcement learning task separately.

It is verified that training using the GQN representation, by comparison, is significantly more robust to the choice of hyperparameters. This is because the sensitivity of the agent to the choice of hyper-parameters when the representation is learned from scratch, and only using RL.

• Maze Environment (partially observed)

Random 7x7 mazes are generated using an OpenGL-based Deep- Mind Lab game engine. Random viewpoints taken from scenes are used to train the GQN and a top-down render is used to verify the uncertainty of the representation. Predicted uncertainty is measured by computing the model’s predicted information gain at each location, averaged over 3 different heading directions. With only a handful of first-person observations, the model is capable of predicting the top-down view with high accuracy, indicating successful integration of egocentric observations. Errors often correspond to the precise points at which corridors connect with rooms.

• Shepard-Metzler Environment

Shepard and Metzler’s mental rotation objects were initially created for cognitive tests to visualize a mental image of an object from all directions. In these experiments, each object is composed of 7 randomly coloured cubes that are positioned by a self-avoiding random walk in 3D grid. The generator is capable of predicting accurate images from arbitrary query viewpoints. This implies that the scene description accurately captures the positions and colours of multiple parts that make up the object. The model’s predictions are consistent with occlusion, lighting and shading.

### Take-aways

• The representation network perceives accurately, and can learn to count, localise and classify objects without any object level labels (as verified from the Shepard-Metzler experiments).
• The generation network is an approximate renderer that is learned from data. It is capable of generating sharp images without any prior specification of the laws of perspective, occlusion, or lighting.
• GQN can represent, measure and reduce uncertainty through the variability of its predictions.
• With compact GQN representations, state-of-the-art deep reinforcement learning agents can learn to complete tasks in a more data-efficient manner compared to model-free baseline agents.

### Parallels and Comparisions

GQN learns the representational space, allowing it to express the presence of textures, parts, objects, lights and scenes concisely and at a suitably high level of abstraction.

In contrast, traditional structure-from-motion, structure-from depth and multi-view geometry techniques prescribe the way in which the 3D structure of the environment is represented (for instance as point clouds, mesh clouds or a collection of pre-defined primitives). Whereas GQN enables task-specific fine-tuning of the representation itself.

• With Other Learning-Based Methods

Other neural approaches (auto-encoders etc) focus on regularities in colors and patches in the image space, but fail to achieve high-level representation.

On the other hand, GQN can account for uncertainty in scenes with high occlusions.

Also, GQN is not specific to particular choice of generation architecture and doesn’t require problem-specific engineering in the design of their generators. Alternatives such as generative adversarial networks (GANs) or auto-regressive models could be employed.

### Limitations

A major restriction of GQN (especially when compared with traditional representations) is that the resulting representations are no longer interpretable.

Also, GQN is not tested over dynamic or interactive scenes. Limited availability of suitable real datasets and a need for controlled analysis resulted in training and testing of GQNs on static, synthetic and non-interactive environments only.

Eslami et al. developed an artificial vision system, the Generative Query Network (GQN) – a single architecture to perceive, interpret and represent synthetic scenes without human labelling paving the way towards fully unsupervised scene understanding, planning and behaviour. Future work of GQNs could be in predictive driving for autonomous vehicles, prespective rendering for VR/AR and GQN based SLAM.

### References:

[*] “Neural Representation and Rendering” - This work was done by S. M. Ali Eslami, Danilo J. Rezende, Frederic Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil Rabinowitz, Helen King, Chloe Hillier, Matt Botvinick, Daan Wierstra, Koray Kavukcuoglu and Demis Hassabis.