Explicit Correspondence Matching
for Generalizable Neural Radiance Fields
arXiv 2023

Abstract

overview We present a new generalizable NeRF method that is able to directly generalize to new unseen scenarios and perform novel view synthesis with as few as two source views. The key to our approach lies in the explicitly modeled correspondence matching information, so as to provide the geometry prior to the prediction of NeRF color and density for volume rendering. The explicit correspondence matching is quantified with the cosine similarity between image features sampled at the 2D projections of a 3D point on different views, which is able to provide reliable cues about the surface geometry. Unlike previous methods where image features are extracted independently for each view, we consider modeling the cross-view interactions via Transformer cross-attention, which greatly improves the feature matching quality. Our method achieves state-of-the-art results on different evaluation settings, with the experiments showing a strong correlation between our learned cosine feature similarity and volume density, demonstrating the effectiveness and superiority of our proposed method.

Model Architecture

MatchNeRF overview. Given \(N\) input images, we extract the Transformer features and compute the cosine similarity in a pair-wise manner, and finally merge all pair-wise cosine similarities with element-wise average. I) For image pair \(I_i\) and \(I_j\), we first extract downsampled convolutional features with a weight-sharing CNN. The convolutional features are then fed into a Transformer to model cross-view interactions with cross-attention. II) To predict the color and volume density of a point on a ray for volume rendering, we project the 3D point into the 2D Transformer features \(F_i\) and \(F_j\) with the camera parameters and bilinearly sample the feature vectors \(f_i\) and \(f_j\) at the projected locations. We then compute the cosine similarity \(z = \cos(f_i, f_j)\) between sampled features to encode the correspondence matching information. III) \(z\) is next used with the 3D position \(p\) and 2D view direction \(d\) for color \(c\) and density \(\sigma\) prediction. An additional ray Transformer is used to model cross-point interactions along a ray.

Direct inference on in-the-wild unseen scenes

Input

Output

Input (courtesy of Qianyi, taken at Queen Victoria Monument)

Output

Input

Output

Direct inference on benchmark test scenes

DTU: Scan38, View24

Blender: Materials, View36

RFF: Leaves, View13

Quantitative comparisons

Comparison with SOTA methods. MatchNeRF performs the best for both 3- and 2-view inputs. Viewpoints nearest to the target one are selected as input. By default we measure over only the foreground or central regions following MVSNeRF's settings, while \(\top\) indicates a more accurate metric by measuring over the whole image. Default 3-view results are borrowed from MVSNeRF's paper. We measure MVSNeRF's 3-view whole image results with its pretrained weight, and retrain with its released code to report 2-view results.

Citation

Acknowledgements

The website template was borrowed from Mip-NeRF.