MatchNeRF

Explicit Correspondence Matching for Generalizable Neural Radiance Fields

TPAMI 2025
Yuedong ChenMonash University
Haofei XuETH Zurich
Qianyi WuMonash University
Chuanxia ZhengUniversity of Oxford
Tat-Jen ChamNanyang Technological University
Jianfei CaiMonash University

Abstract

MatchNeRF overview

We present a new generalizable NeRF method that is able to directly generalize to new unseen scenarios and perform novel view synthesis with as few as two source views. The key to our approach lies in the explicitly modeled correspondence matching information, so as to provide the geometry prior to the prediction of NeRF color and density for volume rendering. The explicit correspondence matching is quantified with the cosine similarity between image features sampled at the 2D projections of a 3D point on different views, which is able to provide reliable cues about the surface geometry. Unlike previous methods where image features are extracted independently for each view, we consider modeling the cross-view interactions via Transformer cross-attention, which greatly improves the feature matching quality. Our method achieves state-of-the-art results on different evaluation settings, with the experiments showing a strong correlation between our learned cosine feature similarity and volume density, demonstrating the effectiveness and superiority of our proposed method.

Model Architecture

MatchNeRF architecture
MatchNeRF overview. Given \(N\) input images, we extract the Transformer features and compute the cosine similarity in a pair-wise manner, and finally merge all pair-wise cosine similarities with element-wise average. I) For image pair \(I_i\) and \(I_j\), we first extract downsampled convolutional features with a weight-sharing CNN. The convolutional features are then fed into a Transformer to model cross-view interactions with cross-attention. II) To predict the color and volume density of a point on a ray for volume rendering, we project the 3D point into the 2D Transformer features \(F_i\) and \(F_j\) with the camera parameters and bilinearly sample the feature vectors \(f_i\) and \(f_j\) at the projected locations. We then compute the cosine similarity \(z = \cos(f_i, f_j)\) between sampled features to encode the correspondence matching information. III) \(z\) is next used with the 3D position \(p\) and 2D view direction \(d\) for color \(c\) and density \(\sigma\) prediction. An additional ray Transformer is used to model cross-point interactions along a ray.

Direct Inference on In-the-Wild Unseen Scenes

Input scene
Input
Output novel view
Output
Queen Victoria Monument input
Input (courtesy of Qianyi, taken at Queen Victoria Monument)
Queen Victoria Monument output
Output
Input scene
Input
Output novel view
Output

Direct Inference on Benchmark Test Scenes

Flowers scene novel view
DTU: Scan38, View24
Orchids scene novel view
Blender: Materials, View36
Orchids scene novel view
RFF: Leaves, View13

Quantitative Comparisons

MatchNeRF comparisons
Comparison with SOTA methods. MatchNeRF performs the best for both 3- and 2-view inputs. Viewpoints nearest to the target one are selected as input. By default we measure over only the foreground or central regions following MVSNeRF's settings, while \(\top\) indicates a more accurate metric by measuring over the whole image. GeoNeRF is retrained without depth supervision for fair comparison. We measure MVSNeRF's 3-view whole image results with its pretrained weight, and retrain with its released code to report 2-view results.

Acknowledgements

This research is supported by the Monash FIT Start-up Grant. Dr. Chuanxia Zheng is supported by EPSRC SYN3D EP/Z001811/1.

BibTeX

@article{chen2025explicit,
  title={Explicit correspondence matching for generalizable neural radiance fields},
  author={Chen, Yuedong and Xu, Haofei and Wu, Qianyi and Zheng, Chuanxia and Cham, Tat-Jen and Cai, Jianfei},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2025},
  publisher={IEEE}
}