Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers

Poster Session C: Friday, August 15, 2:00 – 5:00 pm, de Brug & E‑Hall

Seen2Scene: a generative model of fixation-by-fixation scene understanding

Ritik Raina1, Abe Leite1, Alexandros Graikos2, Seoyoung Ahn3, Greg Zelinsky1; 1State University of New York at Stony Brook, 2Stony Brook University, 3University of California, Berkeley

Presenter: Ritik Raina

Human scene understanding dynamically evolves over the course of sequential viewing fixations from a gist-level understanding to a more detailed comprehension of the scene. Each fixation provides rich visual information about objects and their spatial relationships and we model this incremental process by introducing Seen2Scene, a framework for modeling human scene understanding by controlling the visual inputs available for scene generation. Seen2Scene uses a self-supervised encoder to extract features from fixated scene regions, which guides a pre-trained text-to-image latent diffusion model through a modular adapter framework. As fixations accumulate, the model iteratively refines its visual hypotheses, filling in unseen areas with contextually plausible content. We evaluated Seen2Scene on COCO-FreeView using two experimental conditions: fixation-only conditioning to isolate the contribution of foveal information, and fixation+gist conditioning to examine how peripheral scene information integrates with foveal details. Results show that initial fixations drive the greatest gains in semantic and perceptual fidelity and that the fixation+gist condition reached the high-fidelity scene understanding with the fewest fixations, thus demonstrating the importance of integrating peripheral gist information visual details collected foveally.

Topic Area: Object Recognition & Visual Attention

Extended Abstract: Full Text PDF