Semantic Implicit Neural Scene Representations with Semi-supervised Training | 3DV 2020

Amit Kohli*, Vincent Sitzmann*, Gordon Wetzstein

We demonstrate a representation that jointly encodes shape, appearance, and semantics in a 3D-structure-aware manner.

3DV 2020 Paper - 3 Min Overview

ABSTRACT

Biological vision infers multi-modal 3D representations that support reasoning about scene properties such as materials, appearance, affordance, and semantics in 3D. These rich representations enable us humans, for example, to acquire new skills, such as the learning of a new semantic class, with extremely limited supervision. Motivated by this ability of biological vision, we demonstrate that 3D-structure-aware representation learning leads to multi-modal representations that enable 3D semantic segmentation with extremely limited, 2D-only supervision. Building on emerging neural scene representations, which have been developed for modeling the shape and appearance of 3D scenes supervised exclusively by posed 2D images, we are the first to demonstrate a representation that jointly encodes shape, appearance, and semantics in a 3D-structure-aware manner. Surprisingly, we find that only a few tens of labeled 2D segmentation masks are required to achieve dense 3D semantic segmentation using a semi-supervised learning strategy. We explore two novel applications for our semantically aware neural scene representation: 3D novel view and semantic label synthesis given only a single input RGB image or 2D label mask, as well as 3D interpolation of appearance and semantics.

Files

 

Citation

A. Kohli, V. Sitzmann, G. Wetzstein, “Semantic Implicit Neural Scene Representations with Semi-supervised Training”, International Conference on 3D Vision (3DV) 2020.

BibTeX
@inproceedings{semantic_srn,
author = {A. Kohli and V. Sitzmann and G. Wetzstein},
title = {{Semantic Implicit Neural Scene Representations with Semi-supervised Training}},
booktitle = {International Conference on 3D Vision (3DV)},
year = {2020},
}

Overview of results

Our method is able to produce 3D consistent, multi-view semantic segmentation and appearance of an object given only a single posed RGB image of the object. This input-output ability demonstrates that our model learns a 3D neural scene representation that stores multimodal information about a scene: its appearance and semantic decomposition. To the best of our knowledge, such a representation is the first of its kind and offers a path toward even richer implicit neural representations of scenes.

 

Our proposed method only requires 30 labeled segmentation maps at training time, the entirety of which is shown here. We are able to use such a small dataset by employing a common representation learning technique. First we pre-train a Scene Representation Network in an unsupervised manner, using only posed 2D RGB images. We then update the learned representation by training a linear classifier for semantic segmentation on the 30 labeled segmentation maps. This simple model structure prevents overfitting on such a small training dataset.

 

Due to its multimodal nature, our representation can also utilize a single posed semantic map of an object and similarly reconstruct the full 3D consistent semantic segmentation and appearance of the object at arbitrary views. This representation can be used as a generative tools in addition to being used for the reconstruction of predetermined quantities. Furthermore, such a capability demonstrates the flexibility and power of using neural scene representations that are able to implicitly store inherently entangled modes of information, such as the semantic decomposition and appearance of a simple object, as in our case.

 

By interpolating the features in the neural representation of each object pair, we are able to consequently interpolate smoothly between both the appearance and semantics of the objects. Note how each semantic part of the first object directly morphs into the same semantic part of the second object. Thus our model has learned a meaningful representation that generalizes an understanding of a semantic part across different objects within the same object class. Additionally, this experiment demonstrates a tight coupling between the object’s appearance and semantic segmentation, which is ultimately what we leverage to create a joint representation.

Acknowledgements

This project was supported by a Stanford Graduate Fellowship, an NSF CAREER Award (IIS 1553333), a Terman Faculty Fellowship, a Sloan Fellowship, by the KAUST Office of Sponsored Research through the Visual Computing Center CCF grant, and a PECASE from the ARO.

Related Projects

You may also be interested in related projects focusing on neural scene representations and rendering:

  • Chan et al. pi-GAN. CVPR 2021 (link)
  • Kellnhofer et al. Neural Lumigraph Rendering. CVPR 2021 (link)
  • Lindell et al. Automatic Integration for Fast Neural Rendering. CVPR 2021 (link)
  • Sitzmann et al. Implicit Neural Representations with Periodic Activation Functions. NeurIPS 2020 (link)
  • Sitzmann et al. MetaSDF. NeurIPS 2020 (link)
  • Sitzmann et al. Scene Representation Networks. NeurIPS 2019 (link)
  • Sitzmann et al. Deep Voxels. CVPR 2019 (link)