Sem2NeRF: Converting Single-View
Semantic Masks to Neural Radiance Fields
ECCV 2022

Abstract

overview

Image translation and manipulation have gain increasing attention along with the rapid development of deep generative models. Although existing approaches have brought impressive results, they mainly operated in 2D space. In light of recent advances in NeRF-based 3D-aware generative models, we introduce a new task, Semantic-to-NeRF translation, that aims to reconstruct a 3D scene modelled by NeRF, conditioned on one single-view semantic mask as input. To kick-off this novel task, we propose the Sem2NeRF framework. In particular, Sem2NeRF addresses the highly challenging task by encoding the semantic mask into the latent code that controls the 3D scene representation of a pretrained decoder. To further improve the accuracy of the mapping, we integrate a new region-aware learning strategy into the design of both the encoder and the decoder. We verify the efficacy of the proposed Sem2NeRF and demonstrate that it outperforms several strong baselines on two benchmark datasets.

Video

Model Architecture

Architecture of the Sem2NeRF framework. It aims to convert a single-view semantic mask to a 3D scene represented by NeRF. Specifically, a given semantic mask will be partitioned into patches, which will be further encoded by a patch-based encoder into a latent style code of a pretrained NeRF-based 3D generator. A region R will be randomly sampled to enforce awareness of differences among regions. And an optional latent vector z is included to enable multi-modal synthesis

Results: Free-viewpoint image generation

Free-viewpoint image generation on CelebAMask-HQ. "Output" refers to the generated image that has the same viewing direction as the "Input", and "Overlay" shows the results of overlaying "Output" with "Input", so as to better demonstrate the mapping accuracy

Free-viewpoint image generation on CatMask. "Output" refers to the generated image that has the same viewing direction as the "Input", and "Overlay" shows the results of overlaying "Output" with "Input", so as to better demonstrate the mapping accuracy

Results: Semantic Masks Editing

Semantic mask editing on CelebAMask-HQ. The first row shows the results of the original semantic masks, while the following rows give the results of editing the mentioned area, highlighted with yellow-dash box. Three viewpoints are given for each group, with the first one having the same viewing direction as the input

Semantic mask editing on CatMask. Edited regions are highlighted with yellow-dash box. The first viewpoint has the same pose as the input

Results: Multi-Modal Synthesis

Multi-modal synthesis on CelebAMask-HQ. Styles are linearly blended from left to right. The last viewpoint in each group has the same pose as the input

Multi-modal synthesis on CatMask.

Citation

Acknowledgements

The website template was borrowed from Mip-NeRF.