Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers
Visual Processing in Brains and Models I
Contributed Talk Session: Thursday, August 14, 10:00 – 11:00 am, Room C1.03
How to sample the world for understanding the visual system
Talk 1, 10:00 am – Johannes Roth1, Martin N Hebart2; 1Max Planck Institute for Human cognition and brain sciences, Max-Planck Institute, 2Justus Liebig Universität Gießen
Presenter: Johannes Roth
Understanding vision requires capturing the vast diversity of the visual world we experience. How can we sample this diversity in a manner that supports robust, generalizable inferences? While widely-used, massive neuroimaging datasets have strongly contributed to our understanding of brain function, their ability to comprehensively capture the diversity of visual and semantic experiences remains largely untested. More broadly, the factors required for diverse and generalizable datasets have remained unknown. To address these gaps, we introduce LAION-natural, a curated subset of 120 million natural photographs filtered from LAION-2B, and use it as a proxy of the breadth of our visual experience in assessing visual-semantic coverage. Our analysis of CLIP embeddings of these images reveals significant representational gaps in existing datasets, demonstrating that they cover only a restricted subset of the space spanned by LAION-natural. Simulations and analyses of functional MRI data further show that these gaps lead to impaired out-of-distribution generalization. Importantly, our results reveal that even moderately sized stimulus sets can achieve strong generalization if they are sampled from a diverse stimulus pool, and that this diversity is more important than the specific sampling strategy employed. These findings not only highlight limitations of existing datasets in generalizability and model comparison, but also provide clear strategies for future studies to support the development of stronger computational models of the visual system and generalizable inferences.
Robustness to 3D Object Transformations in Humans and Image-Based Deep Neural Networks
Talk 2, 10:10 am – Haider Al-Tahan1, Farzad Shayanfar2, Ehsan Tousi3, Marieke Mur3; 1Georgia Institute of Technology, 2Shahid Beheshti University of Medical Sciences, 3University of Western Ontario
Presenter: Haider Al-Tahan
Recent work at the intersection of psychology, neuroscience, and computer vision has advocated for the use of more realistic visual tasks in modeling human vision. Deep neural networks have become leading models of the primate visual system. However, their behavior under identity-preserving 3D object transformations, such as translation, scaling, and rotation, has not been thoroughly compared to humans. Here, we evaluate both humans and image-based deep neural networks, including vision-only and vision-language models trained with supervised, self-supervised, or weakly supervised objectives, on their ability to recognize objects undergoing such transformations. Humans (n=220) and models (n=169) were asked to categorize images of 3D objects, generated with a custom pipeline, into 16 object categories recognizable by both. Humans were time-limited to reduce reliance on recurrent processing. We find that both humans and models are robust to translation and scaling, but models struggle more with object rotation and are more sensitive to contextual changes. Humans and models agree on which in-depth object rotations are most challenging -- when humans struggle, models do too -- but humans are more robust and show more consistent category confusions with one another than with any model. By testing model families trained on different amounts of data and with different learning objectives, we show that data richness plays a substantial role in supporting robustness -- potentially more so than vision-language alignment. Our benchmark excludes models trained on video, multiview, or 3D data, but is in principle compatible with such models and may support their evaluation in future work. This study underscores the importance of using naturalistic visual tasks to model human object perception in complex, real-world scenarios, and introduces a benchmark - ORBIT (Object Recognition Benchmark for Invariance to Transformations) - for evaluating and developing computational models of human object recognition. Code and data for ORBIT are available at: https://github.com/haideraltahan/ORBIT.
Incorporating foveal sampling and integration to model 3D shape inferences
Talk 3, 10:20 am – Stephanie Fu1, tyler bonnen2, Trevor Darrell3; 1University of California, Berkeley, 2Electrical Engineering & Computer Science Department, University of California, Berkeley, 3Electrical Engineering & Computer Science Department
Presenter: Stephanie Fu
Human vision is inherently sequential. This is largely because of foveal constraints on the retina, which demand that we shift our gaze to collect high-resolution information from throughout the environment. Within the domain of visual object perception, the neural substrates that support these sequential visual inferences have been well characterized: ventral temporal cortex (VTC) rapidly extracts visual features at each spatial location, while medial temporal cortex (MTC) integrates over the sequential outputs of VTC. This neurocomputational motif is absent in contemporary deep learning models of human vision. Not surprisingly, contemporary models approximate the rapid visual inferences that depend on VTC, but not those behaviors that depend on MTC (e.g., novel 3D shape inference). Here we develop a modeling framework that embodies the sequential sampling/integration strategy emblematic of human vision. Given an image, this model first determines relevant locations to attend to, sequentially processes these locations as ‘foveated’ inputs using a VTC-like model, then integrates over these sequential visual features within a MTC-like model. Here we report preliminary results on the design choices that lead to stable model optimization and subsequent model behaviors.
Connectome-Constrained Unsupervised Learning Reveals Emergent Visual Representations in the Drosophila Optic Lobe
Talk 4, 10:30 am – Keisuke Toyoda1, Naoya Nishiura2, Rintaro Kai1, Masataka Watanabe; 1The University of Tokyo, Tokyo Institute of Technology, 2The University of Tokyo
Presenter: Keisuke Toyoda
Understanding how brain structure enables visual processing is crucial. While \textit{Drosophila} offers a complete connectome, computational models often use biologically implausible supervised signals. We address this by building a large-scale autoencoder constrained by the complete \textit{Drosophila} right optic lobe connectome ($\sim$45k neurons, FlyWire dataset). Using photoreceptors (R1-R6) as both input and output, the model incorporates anatomical feedforward and feedback loops and was trained unsupervised on naturalistic video stimuli to minimize reconstruction error. Temporal offsets were included to probe predictive capacity. The autoencoder accurately reconstructed photoreceptor inputs with high fidelity. Deeper layer neurons (medulla, lobula) showed moderate, stable activity under sustained input, consistent with efficient engagement and functional recurrent loops. Temporal offsets improved short-term prediction, indicating learned input dynamics. We demonstrate that a connectome-based autoencoder can learn meaningful visual representations via biologically plausible unsupervised learning. This highlights how anatomical structure shapes emergent function and provides a digital twin framework for studying visual processing beyond task-specific supervised approaches, suggesting complex representations can arise from self-organization on detailed neural circuits.
Developmental plasticity rules facilitate representation learning in a model of visual ventral stream
Talk 5, 10:40 am – Ariane Delrocq1, Zihan Wu1, Guillaume Bellec2, Wulfram Gerstner1; 1EPFL - EPF Lausanne, 2Technische Universität Wien
Presenter: Ariane Delrocq
It is known that different cortical areas have different critical periods for the most fundamental learning. However, the type of developmental plasticity rules that lead to high-level representations of objects are unknown. Here, we study a model of the visual ventral stream trained by a generalized Hebbian plasticity rule. The learning rule uses only quantities that are locally available at the site of the synapse, is consistent with recent plasticity experiments in pyramidal neurons, and, as opposed to the backpropagation algorithm, does not need a detailed feedback architecture. Our model shows that limiting plasticity in time to critical periods of development improves the quality of learned representation. Our model achieves state-of-the-art performance for bio-plausible plasticity models on the STL10 large image dataset designed for unsupervised learning.