RING-NeRF

RING-NeRF : Rethinking Inductive Biases for Versatile and Efficient Neural Fields

Doriand Petit¹²
Steve Bourgeois¹
Dumitru Pavel¹
Vincent Gay-Bellile¹
Florian Chabot¹
Loïc Barthe²

🎉 RING-NeRF has been accepted to ECCV'24 ! See you @ Milan ! 🎉

Abstract

Recent advances in Neural Fields mostly rely on developing task-specific supervision which often complicates the models. Rather than developing hard-to-combine and specific modules, another approach generally overlooked is to directly inject generic priors on the scene representation (also called inductive biases) into the NeRF architecture. Based on this idea, we propose the RING-NeRF architecture which in cludes two inductive biases : a continuous multi-scale representation of the scene and an invariance of the decoder’s latent space over spatial and scale domains. We also design a single reconstruction process that takes advantage of those inductive biases and experimentally demonstrates on par performances in terms of quality with dedicated architecture on multiple tasks (anti-aliasing, few view reconstruction, SDF reconstruction without scene-specific initialization) while being more efficient. Moreover, RING-NeRF has the distinctive ability to dynamically increase the resolution of the model, opening the way to adaptive reconstruction.

Global Scheme

Overview of RING-NeRF: to render a pixel, the casted cone is sampled with cubes. Depending on the cube volume, the corresponding LOD of the scene is selected and the latent feature is computed using a weighted sum of the grid hierarchy. The density (or SDF) and color of the cube are first decoded from the latent feature with a tiny MLP and then integrated with other samples through volume rendering.

A NERF architecture with an LOD inductive bias

While a Neural Field based on a single 3D feature grid defines a single mapping function between the MLP-decoder feature space and the 3D scene space, our architecture, with its 3D hierarchical grid linked with residual connections, is able to define such a function continuously in the scale space. The architecture itself induces, by construction, a notion of level of details in the representation, allowing to reconstruct a scene at multiple precisions even if the reconstruction is only supervised at the highest level of details.

First Grid Only

All 8 Grids

The model illustrated here contains 8 grids of increasing resolution which have all been used jointly for the training at maximum precision. Even without specific supervision, any LOD in-between 1 and 8 can be chosen to reconstruct the scene with the corresponding resolution.

Moreover, unlike other architectures that rely on a decoder latent-space which is not invariant to the position in the 3D scene space due to the use of positional encoding (eg. NGLOD) nor invariant to the level of details due to the use of feature concatenation (eg. Instant-NGP) , feature modification based on the LOD (eg. Zip-NeRF) or use of per-LOD MLP-decoder (eg. PyNeRF), our architecture relies on a latent space that is invariant to both of them. This property makes our architecture particularly suitable for incremental reconstruction in space and level of details.

Distance-Aware Forward Mapping for Fast Anti-Aliazing

Anti-Aliasing Mechanism

We also use the properties of our architecture to propose an adapted distance-aware forward mapping to easily solve aliasing issues.

The nerfacto baseline (right) cannot easily interact with images of different resolutions and/or varying distances of observation. This results in blurred renderings when looking at high resolution renderings and the apparition of aliasing artifacts at low resolution. In comparison, RING-NeRF (left) better adapts to these parameters and produces better renderings in both high and low resolution (respectively near and far observations).

Fast Training and Rendering

Similarly to Zip-NeRF, our solution uses cone-casting to adjust the level of details depending on the distance of the sample to the camera. However, unlike Zip-NeRF, the LOD is directly captured by our scene representation. Consequently, our solution is much faster than Zip-NeRF for both reconstruction and rendering since it does not need to achieve multi-sampling, while providing equivalent quality.

Supervision-Free Anti-Aliasing

Thanks to its LOD inductive bias, RING-NeRF does not necessarily rely on supervision to produce different level of detail. This implies that our architecture can generalize over novel observation distances without seeing them during training. This property, although important, is not common in other anti-aliasing methods. For instance, both PyNeRF and ZipNeRF cannot render coherent aliasing-free images at novel observation distances. This is illustrated in the following video, where all three models were solely trained on the full resolution images and renders were done in 1/8th resolution. Note that the aliasing artefacts are especially visible on the central object.

Continuous Coarse-To-Fine Optimization for more Robust NeRF models

NeRF models often lack robustness when facing harder-than-usual setups, such as few viewpoints or when foregoing the scene-specific initialization in surface-based reconstruction. While we demonstrated that RING-NeRF is stabler in itself than the Instant-NGP-based nerfacto, we also proposed an adapted coarse-to-fine training process useful to bring much more robustness. While some works already used similar progressive regularization with Instant-NGP-based models (eg. NeuralAngelo), we benefit from RING-NeRF's properties to obtain a more adapted continuous process which progressively uses more degrees of freedom for the mapping function in the decoder latent space rather than learn new dimensions of it from scratch like concatenation-based architectures.

We demonstrate the increased stability first on few viewpoints setups. Combined with a simple density loss taken from FreeNeRF to reduce near-cameras artifacts, this results in more 3D-coherent reconstructions when using few views than the adapted nerfacto+ baseline (nerfacto with coarse-to-fine and the same density loss). Note that the unadapted nerfacto does not succeed in creating a coherent geometry, as it mostly overfits in front of the training cameras.

SDF reconstruction is known to be a more unstable process than density-based NeRF reconstruction, requiring a scene-specific initialization to converge, as it adds an Eikonal constraint on the model's output. This scene-specific initialization becomes an issue in complex environments and incremental setups, where several types of scenes can coexist and are not necessarily known beforehand. In light of our previously shown increase of stability, we also demonstrated that RING-NeRF permits to forego the initialization of our model while keeping both a precision close to initialization-required models and the convergence speed of faster models (eg. Neus-Facto) which fail without initialization.

Resolution Extensibility Property

One unique property of our scene representation is its capacity to be dynamically refined by adding new grid levels without modifying the decoder’s weights or previously trained grids. This resolution extensibility property opens the path to adaptive resolution models, where the precision used to describe an area depends on the details needed, to optimize efficiency both in memory consumption and training duration.

Here, we showcase the property of our model. We use a grid hierarchy of 6 levels from 16 to 512 max resolution. We begin the reconstruction with only the first three levels, train both the grid and the decoder to convergence and then freeze both of them. We then proceed to train to convergence one novel grid at a time, freezing the previous one at convergence.

This showcases both the capacity of our model to reconstruct finer details after the decoder's training and its ability to keep the coarser LOD valid.

Citation

@inproceedings{petit2024ring,
    title={RING-NeRF: Rethinking Inductive Biases for Versatile and Efficient Neural Fields},
    author={Petit, Doriand and Bourgeois, Steve and Pavel, Dumitru and Gay-Bellile, Vincent and Chabot, Florian and Barthe, Loic},
    journal={European Conference on Computer Vision (ECCV)},
    year={2024}
}

Acknowledgements

This publication was made possible by the use of the CEA List FactoryIA supercomputer, financially supported by the Ile-de-France Regional Council

The website template was borrowed from Michaël Gharbi, Ref-NeRF and nerfies.