Poster Presentation

Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers

Poster Session A: Tuesday, August 12, 1:30 – 4:30 pm, de Brug & E‑Hall

On the Origin of 3D Perception in Visual World Models

Wanhee Lee¹, Klemen Kotar¹, Jared Watrous¹, Rahul Mysore Venkatesh¹, Honglin Chen¹, Khai Loong Aw¹, Khaled Jedoui¹, Daniel LK Yamins¹; ¹Stanford University

Presenter: Wanhee Lee

3D perception is fundamental to both biological and artificial vision, enabling navigation, interaction, and scene understanding. However, learning 3D structure in a self-supervised manner remains challenging: fully structured geometric methods impose rigid constraints that limit adaptability to natural videos, while unstructured, data-driven approaches lack geometric consistency and controllability. We propose a hybrid approach that starts with minimal priors and progressively builds structured representations from intermediate cues. Specifically, we extract optical flow from an autoregressive video model, use it to infer depth and subsequently 3D shape, and feed these representations back into the model. Our framework enables 3D understanding from a single image, achieving human-level depth estimation, supporting shape inference beyond visible surfaces, and completing 3D scene representations without explicit supervision. Moreover, the model’s learning trajectory aligns with developmental patterns of depth perception in humans, providing insights into both cognitive and artificial vision. These findings demonstrate that 3D perception can emerge through minimally structured learning in a developmentally plausible way.

Topic Area: Visual Processing & Computational Vision

Extended Abstract: Full Text PDF