Sem2NeRF: Converting Single-View Semantic Masks to Neural Radiance Fields ECCV 2022
- Yuedong Chen Monash University
- Qianyi Wu Monash University
- Chuanxia Zheng Monash University
- Tat-Jen Cham Nanyang Technological University
- Jianfei Cai Monash University
Abstract
Image translation and manipulation have gain increasing attention along with the rapid development of deep generative models. Although existing approaches have brought impressive results, they mainly operated in 2D space. In light of recent advances in NeRF-based 3D-aware generative models, we introduce a new task, Semantic-to-NeRF translation, that aims to reconstruct a 3D scene modelled by NeRF, conditioned on one single-view semantic mask as input. To kick-off this novel task, we propose the Sem2NeRF framework. In particular, Sem2NeRF addresses the highly challenging task by encoding the semantic mask into the latent code that controls the 3D scene representation of a pretrained decoder. To further improve the accuracy of the mapping, we integrate a new region-aware learning strategy into the design of both the encoder and the decoder. We verify the efficacy of the proposed Sem2NeRF and demonstrate that it outperforms several strong baselines on two benchmark datasets.
Video
Model Architecture
Architecture of the Sem2NeRF framework. It aims to convert a single-view semantic mask to a 3D scene represented by NeRF. Specifically, a given semantic mask will be partitioned into patches, which will be further encoded by a patch-based encoder into a latent style code of a pretrained NeRF-based 3D generator. A region R will be randomly sampled to enforce awareness of differences among regions. And an optional latent vector z is included to enable multi-modal synthesis
Results: Free-viewpoint image generation
Free-viewpoint image generation on CelebAMask-HQ. "Output" refers to the generated image that has the same viewing direction as the "Input", and "Overlay" shows the results of overlaying "Output" with "Input", so as to better demonstrate the mapping accuracy
Free-viewpoint image generation on CatMask. "Output" refers to the generated image that has the same viewing direction as the "Input", and "Overlay" shows the results of overlaying "Output" with "Input", so as to better demonstrate the mapping accuracy
Results: Semantic Masks Editing
Semantic mask editing on CelebAMask-HQ. The first row shows the results of the original semantic masks, while the following rows give the results of editing the mentioned area, highlighted with yellow-dash box. Three viewpoints are given for each group, with the first one having the same viewing direction as the input
Semantic mask editing on CatMask. Edited regions are highlighted with yellow-dash box. The first viewpoint has the same pose as the input
Results: Multi-Modal Synthesis
Multi-modal synthesis on CelebAMask-HQ. Styles are linearly blended from left to right. The last viewpoint in each group has the same pose as the input
Multi-modal synthesis on CatMask.
Citation
Acknowledgements
The website template was borrowed from Mip-NeRF.