Explicit Correspondence Matching for Generalizable Neural Radiance Fields arXiv 2023
- Yuedong Chen Monash University
- Haofei Xu ETH Zurich
- Qianyi Wu Monash University
- Chuanxia Zheng University of Oxford
- Tat-Jen Cham Nanyang Technological University
- Jianfei Cai Monash University
Abstract
Model Architecture
MatchNeRF overview. Given \(N\) input images, we extract the Transformer features and compute the cosine similarity in a pair-wise manner, and finally merge all pair-wise cosine similarities with element-wise average. I) For image pair \(I_i\) and \(I_j\), we first extract downsampled convolutional features with a weight-sharing CNN. The convolutional features are then fed into a Transformer to model cross-view interactions with cross-attention. II) To predict the color and volume density of a point on a ray for volume rendering, we project the 3D point into the 2D Transformer features \(F_i\) and \(F_j\) with the camera parameters and bilinearly sample the feature vectors \(f_i\) and \(f_j\) at the projected locations. We then compute the cosine similarity \(z = \cos(f_i, f_j)\) between sampled features to encode the correspondence matching information. III) \(z\) is next used with the 3D position \(p\) and 2D view direction \(d\) for color \(c\) and density \(\sigma\) prediction. An additional ray Transformer is used to model cross-point interactions along a ray.
Direct inference on in-the-wild unseen scenes
Input
Output
Input (courtesy of Qianyi, taken at Queen Victoria Monument)
Output
Input
Output
Direct inference on benchmark test scenes
DTU: Scan38, View24
Blender: Materials, View36
RFF: Leaves, View13
Quantitative comparisons
Comparison with SOTA methods. MatchNeRF performs the best for both 3- and 2-view inputs. Viewpoints nearest to the target one are selected as input. By default we measure over only the foreground or central regions following MVSNeRF's settings, while \(\top\) indicates a more accurate metric by measuring over the whole image. Default 3-view results are borrowed from MVSNeRF's paper. We measure MVSNeRF's 3-view whole image results with its pretrained weight, and retrain with its released code to report 2-view results.
Citation
Acknowledgements
The website template was borrowed from Mip-NeRF.