Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers
Poster Session A: Tuesday, August 12, 1:30 – 4:30 pm, de Brug & E‑Hall
Modeling the Human Visual System: Comparative Insights from Response-Optimized and Task-Optimized Vision Models, Language Models, and different Readout Mechanisms
Shreya Saha1, Ishaan Chadha1, Meenakshi Khosla1; 1University of California, San Diego
Presenter: Shreya Saha
Over the past decade, predictive modeling of neural responses in the primate visual system has advanced significantly, driven by diverse deep neural network approaches. These include models optimized for visual recognition, cross-modal alignment through contrastive objectives, neural response prediction from scratch, and embeddings from large language models (LLMs). Additionally, various readout mechanisms—from fully linear to spatial-feature factorized methods—have been developed to map network activations to neural responses. Despite this progress, it remains unclear which approach performs best across different regions of the visual hierarchy. In this study, we systematically compare these methods for modeling the human visual system and propose novel strategies to enhance response predictions. We demonstrate that the choice of readout mechanism significantly impacts prediction accuracy and introduce a biologically grounded readout that dynamically adjusts receptive fields based on image content and learns geometric invariances of voxel responses directly from data. This novel readout outperforms factorized methods by 3-23\% and standard ridge regression by 7-53\%, setting a new benchmark for neural response prediction. Our findings reveal distinct modeling advantages across the visual hierarchy: response-optimized models with visual inputs excel in early to mid-level visual areas, while embeddings from LLMs—leveraging detailed contextual descriptions of images—and task-optimized models pretrained on large vision datasets provide the best fit for higher visual regions. Through comparative analysis, we identify three functionally distinct regions in the visual cortex: one sensitive to perceptual features not captured by linguistic descriptions, another attuned to fine-grained visual details encoding semantic information, and a third responsive to abstract, global meanings aligned with linguistic content. Together, these findings offer key insights into building more precise models of the visual system.
Topic Area: Visual Processing & Computational Vision
proceeding: Full Text on OpenReview