Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers
Poster Session A: Tuesday, August 12, 1:30 – 4:30 pm, de Brug & E‑Hall
On the Origin of 3D Perception in Visual World Models
Wanhee Lee1, Klemen Kotar1, Jared Watrous1, Rahul Mysore Venkatesh1, Honglin Chen1, Khai Loong Aw1, Khaled Jedoui1, Daniel LK Yamins1; 1Stanford University
Presenter: Wanhee Lee
3D perception is fundamental to both biological and artificial vision, enabling navigation, interaction, and scene understanding. However, learning 3D structure in a self-supervised manner remains challenging: fully structured geometric methods impose rigid constraints that limit adaptability to natural videos, while unstructured, data-driven approaches lack geometric consistency and controllability. We propose a hybrid approach that starts with minimal priors and progressively builds structured representations from intermediate cues. Specifically, we extract optical flow from an autoregressive video model, use it to infer depth and subsequently 3D shape, and feed these representations back into the model. Our framework enables 3D understanding from a single image, achieving human-level depth estimation, supporting shape inference beyond visible surfaces, and completing 3D scene representations without explicit supervision. Moreover, the model’s learning trajectory aligns with developmental patterns of depth perception in humans, providing insights into both cognitive and artificial vision. These findings demonstrate that 3D perception can emerge through minimally structured learning in a developmentally plausible way.
Topic Area: Visual Processing & Computational Vision
Extended Abstract: Full Text PDF