LLaVA³ : Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs

🎉 LLaVA³ has been accepted to AAAI'26 ! See you in Singapore ! 🎉

LLaVA³ empowers 3D understanding ability of Vision Language Models (VLM) through a new paradigm of 2D visual representations of 3D scenes. Our representation relies on an object-centric description of the scene, each object being visually described from a multitude of viewpoints jointly. Reconstructed from multi-view images, this representation permits VLMs to achieve various tasks such as 3D Visual Question Answering, 3D Grounding or 3D Semantic Segmentation.




Abstract

Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLMs). As an alternative, we introduce LLaVA³ (pronounced LLaVA Cube), a novel method that improves the 3D scene understanding capabilities of VLMs using only multi-view 2D images, and without requiring any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single 2D picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D visual question answering and 3D language grounding show that our approach significantly outperforms previous 2D-based VLM solutions.



Overview of LLaVA³


Overview of LLaVA³. We first reconstruct the 3D scene as a NeRF from multi-view images with an associated LLaVA feature field. We also derive a hierarchical 3D segmentation of our NeRF. For each object, we create an omni-directional visual-description as a set of tokens. After object re-ordering, we can finally feed them to the VLM for 3D interpretation.



What is different from previous 3D scene understanding paradigms ?

Our approach differs fundamentally from previous works. While 3D Multi-modal LLMs (3D MLLMs) are often limited by data scarcity and regular 2D Multi-view VLMs struggle with uneven spatial sampling and lack of structure, LLaVA³ introduces a structured, object-centric paradigm. Unlike hybrid approaches that rely on "global sampling" which leads to unstructured and redundant "token soup" representations, we propose a "Cubist" approach. We decompose the scene into distinct 3D objects and describe each one through a curated set of omnidirectional visual tokens. This ensures structured, spatially diverse, and semantically rich input for the VLM.



How does LLaVA³ work in practice ?


Reconstructing the 3D scene with LLaVA and multi-scale SAM feature fields

We extend the grid-based Nerfacto model to learn a dense 3D LLaVA feature field. Crucially, to capture both the semantics of objects and their spatial relationships, we jointly model view-independent (VI) features and view-dependent (VD) features. The VI features provide stable semantic information invariant to the viewpoint, while the VD features capture spatial relations (like occlusion or relative position) that change based on the observer's angle. This results in a rich 3D representation where we can query LLaVA tokens from any point in space and any viewing direction. Additionally, to structure the scene into objects, we learn a 3-Level hierarchical feature field supervised by multi-scale SAM masks (Object, Part, Sub-part). We also augment the field by adding anopen-vocabulary semantic branch by supervising each SAM mask with its associated CLIP feature.

Here, we showcase the 3D feature fields associated with the scene, representing both LLaVA tokens and multi-scale hierarchical instance features with CLIP.


Extracting an object hierarchy from the multi-scale SAM feature field

Using a per-scale HDBScan clustering on these SAM-based features, we extract a discrete 3D hierarchical scene graph. In order to reduce segmentation noise and improve the consistency, we refine this hierarchy via heuristics based on the associated multi-scale CLIP feature field. This graph organizes the scene into a coherent hierarchy, allowing us to identify unique objects and their constituent parts.

Here, we showcase the object hierarchy extracted from the multi-scale SAM feature field. We can see that the scene is decomposed into distinct 3D segments that are clustered into a hierarchy, grouping accurately sub-parts into parts, and parts into objects.


Creating an omni-directional visual-description for each object

Once we have our objects, we generate an "omni-directional" visual description for each one. Instead of taking a single snapshot, we sample LLaVA tokens from the 3D feature field, distributing the sampling points equally across the object's sub-components to ensure coverage. We adaptively mix view-independent features (for robust semantics) and view-dependent features (aligned with a canonical viewing direction for spatial context). This collection of tokens forms a "Cubist" representation, describing the object from all relevant angles simultaneously.

We extract a set of visual tokens from the NeRF for each object. These tokens are not just random points; they are carefully selected to represent the object's geometry and appearance comprehensively for the VLM.


Ordering the objects to create a coherent scene description

Finally, we need to feed these object descriptions to the VLM. Since LLMs process input sequentially, the order matters. We employ a deterministic "radar-inspired" sorting strategy. We compute the centroid of each object and sort them by their polar angle around the scene's center (sweeping from a fixed start point). This provides the VLM with a consistent, geometry-aware sequence of inputs, acting like a structured scan of the room, which significantly aids in spatial reasoning tasks.

Here, we showcase the ordered objects in the scene. We can see the "radar sweep" logic in action, organizing the objects into a sequence that preserves their spatial layout context for the language model.



LLaVA³ Results


Visual Question Answering and Grounding

We evaluated LLaVA³ on challenging 3D scene understanding benchmarks. For 3D Visual Question Answering (ScanQA and MSR3D), our structured, object-centric approach significantly outperforms previous 2D-based VLMs and unstructured hybrid methods (like SplatTalk). By explicitly modeling objects and their spatial relations via view-dependent features, LLaVA³ excels at answering complex queries about existence, attributes, and spatial navigation. Because our representation is inherently tied to 3D object segments, LLaVA³ can also perform qualitative grounding, bydirectly predict the ID of the object mentioned in a user question, allowing for precise retrieval and highlighting of the target object in 3D space.

In this video, we show examples of VQA and grounding. LLaVA³ correctly identifies objects and spatial relationships that other baselines miss, such as counting cushions accurately or describing the location of objects relative to other furniture. It can also detect objects based on questions prompts which do not contain the specific object name.




Because we retrieve per-object visual tokens to feed the VLM, we can also perform the same question-driver tasks on specific objects (or parts or subparts) or scene regions by only feeding the VLM the tokens of specific objects of the scene. This enables both localized VQA and fine-scaled grounding, as shown in the next video.




Omni-types Segmentation

Leveraging our hierarchical SAM-CLIP feature field, LLaVA³ supports a wide range of segmentation tasks beyond just VQA, even when discarding the VLM. This includes 3D Semantic Segmentation, Open-Vocabulary Segmentation, and multi-scale Instance Segmentation. The hierarchy allows us to query the scene at different levels of granularity, detecting whole objects, specific parts, or even sub-parts based on textual descriptions.


This image showcases multi-scale Instance Segmentation (top row) and Semantic Segmentation (bottom row) results. Specifically, regarding the semantic segmentation task, our method produces clean, coherent segmentation maps that respect object boundaries, significantly outperforming methods like LeRF or OpenNeRF in mean IoU and Accuracy.


Finally, we visualize here Open-Vocabulary Segmentation results. The model can segment specific object using only the SAM-CLIP feature field and discarding the LLaVA VLM.



Citation

Acknowledgements

This publication was made possible by the use of the CEA List FactoryIA supercomputer, financially supported by the Ile-de-France Regional Council

The website template was borrowed from Michaël Gharbi, Ref-NeRF and nerfies.