Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers

Poster Session B: Wednesday, August 13, 1:00 – 4:00 pm, de Brug & E‑Hall

Can Vision Language Models Follow Human Gaze?

Zory Zhang1, Pinyuan Feng2, Bingyang Wang3, Tianwei Zhao4, Qingying Gao4, Suyang Yu5, Ziqiao Ma6, Hokin Deng4, Yijiang Li7, Dezhi Luo8; 1Brown University, 2Columbia University, 3Emory University, 4Johns Hopkins University, 5University of Washington, 6University of Michigan, 7University of California, San Diego, 8University of Michigan - Ann Arbor

Presenter: Zory Zhang

Gaze understanding is suggested as a precursor to inferring intentions and engaging in joint attention, core capacities for a theory of minds, social learning, and language acquisition. As Vision Language Models (VLMs) become increasingly promising in interactive applications, assessing whether they master this foundational socio-cognitive skill becomes vital. Rather than creating a benchmark, we aim to probe the cognitive features of the underlying gaze understanding. We curated a set of images with systematically controlled difficulty and variability, evaluated 111 VLMs' abilities to infer gaze referents, and analyzed their performance using mixed-effect models. Only 20 VLMs performed above chance with still low overall accuracy. We further analyzed 4 of these top-tier VLMs and found that their performance declined with increasing task difficulty but varied only slightly with the specific prompt and gazer. While their gaze understanding remains far from mature, the patterns suggest that their inferences are far different than merely stochastic parroting. This early progress highlights the need for mechanistic investigations of their underlying emergent inference.

Topic Area: Visual Processing & Computational Vision

Extended Abstract: Full Text PDF