Poster Presentation

Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers

Poster Session B: Wednesday, August 13, 1:00 – 4:00 pm, de Brug & E‑Hall

Can Vision Language Models Follow Human Gaze?

Zory Zhang¹, Pinyuan Feng², Bingyang Wang³, Tianwei Zhao⁴, Qingying Gao⁴, Suyang Yu⁵, Ziqiao Ma⁶, Hokin Deng⁴, Yijiang Li⁷, Dezhi Luo⁸; ¹Brown University, ²Columbia University, ³Emory University, ⁴Johns Hopkins University, ⁵University of Washington, ⁶University of Michigan, ⁷University of California, San Diego, ⁸University of Michigan - Ann Arbor

Presenter: Zory Zhang

Gaze understanding is suggested as a precursor to inferring intentions and engaging in joint attention, core capacities for a theory of minds, social learning, and language acquisition. As Vision Language Models (VLMs) become increasingly promising in interactive applications, assessing whether they master this foundational socio-cognitive skill becomes vital. Rather than creating a benchmark, we aim to probe the cognitive features of the underlying gaze understanding. We curated a set of images with systematically controlled difficulty and variability, evaluated 111 VLMs' abilities to infer gaze referents, and analyzed their performance using mixed-effect models. Only 20 VLMs performed above chance with still low overall accuracy. We further analyzed 4 of these top-tier VLMs and found that their performance declined with increasing task difficulty but varied only slightly with the specific prompt and gazer. While their gaze understanding remains far from mature, the patterns suggest that their inferences are far different than merely stochastic parroting. This early progress highlights the need for mechanistic investigations of their underlying emergent inference.

Topic Area: Visual Processing & Computational Vision

Extended Abstract: Full Text PDF