MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views

NeurIPS 2024

1Monash University,  2VGG, University of Oxford,  3ETH Zurich, 
4University of Tübingen, Tübingen AI Center,  5Nanyang Technological University

TL;DR: MVSplat360 is a feed-forward approach for 360° novel view synthesis of diverse real-world scenes using only sparse observations.

Abstract

We introduce MVSplat360, a feed-forward approach for 360° novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations. This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided, making it challenging for conventional methods to achieve high-quality results. Our MVSplat360 addresses this by effectively combining geometry-aware 3D reconstruction with temporally consistent video generation. Specifically, it refactors a feed-forward 3D Gaussian Splatting (3DGS) model to render features directly into the latent space of a pre-trained Stable Video Diffusion (SVD) model, where these features then act as pose and visual cues to guide the denoising process and produce photorealistic 3D-consistent views. Our model is end-to-end trainable and supports rendering arbitrary views with as few as 5 sparse input views. To evaluate MVSplat360’s performance, we introduce a new benchmark using the challenging DL3DV-10K dataset, where MVSplat360 achieves superior visual quality compared to state-of-the-art methods on wide-sweeping or even 360° NVS tasks. Experiments on the existing benchmark RealEstate10K also confirm the effectiveness of our model.

Overview

Overview of our MVSplat360. (a) Given sparse posed images as input, we first match and fuse the multi-view information using a multi-view Transformer and cost volume-based encoder. (b) Next, a 3DGS representation is constructed to represent the coarse geometry of the entire scene. (c) Considering such coarse reconstruction is imperfect, we further adapt a pre-trained SVD, using features rendered from the 3DGS representation as conditions to achieve 360° novel view synthesis.

Comparisions on DL3DV-10K

We present qualitative comparisons with the following state-of-the-art models:

  • pixelSplat [Charatan et. al CVPR 2024]: The pioneering feed-forward 3D Gaussians model that uses a data-driven regression architecture to predict Gaussian centers, which produces poor visual quality in large-scale scenes.
  • MVSplat [Chen et. al ECCV 2024]: The latest published feed-forward 3DGS model that leverages a multi-view transformer and cost volume, performing well on small-scale scenes. However, it lacks generative capability, limiting its ability to handle disoccluded or unobserved regions in sparse-view large-scale scenes..
  • latentSpalt [Wewer et. al ECCV 2024]: Another latest published feed-forward novel view synthesis approach that combines 3DGS with a VAE-GAN decoder. Its GAN-based framework improves visual quality, though only to a limited extent.
comparison on DL3DV-10K dataset

Acknowledgements

This research is supported by the Monash FIT Start-up Grant. Dr. Chuanxia Zheng is supported by EPSRC SYN3D EP/Z001811/1.

BibTeX

@article{chen2024mvsplat360,
    title     = {MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views},
    author    = {Chen, Yuedong and Zheng, Chuanxia and Xu, Haofei and Zhuang, Bohan and Vedaldi, Andrea and Cham, Tat-Jen and Cai, Jianfei},
    booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
    year      = {2024},
}