Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers
Poster Session B: Wednesday, August 13, 1:00 – 4:00 pm, de Brug & E‑Hall
Compositional Meaning in Vision-Language Models and the Brain
Maithe van Noort1, Luke Korthals1, Giacomo Aldegheri2, Marianne De Heer Kloots1, Micha Heilbron1; 1University of Amsterdam, 2Justus Liebig Universität Gießen
Presenter: Maithe van Noort
What is the role of compositional structure in the alignment of visual and linguistic brain areas to computational semantic embeddings? Vision-language models (VLMs) have shown meaningful alignment to the brain in their representations of semantic structure, for both images and text. However, the extent to which these representations capture compositional structure -- i.e. changes in meaning based on changes to the combinatorial structure of parts -- remains uncertain. Here we leverage Winoground, a dataset designed to test compositionality in multimodal representations, to compare the compositional structure captured by different model embeddings, as well as fMRI responses collected as part of a larger study on multi-modal meaning (with 2760 image and 2760 semantically equivalent language trials). In contrast to VLM embeddings, neural representations in the brain show a striking absence of compositional processing (chance level performance) when evaluated on the Winoground benchmark -- despite robust semantic encoding of individual concepts as measured by voxel activity predictions. This is intriguing as distinctions between stimuli in Winoground are trivial to any English-speaking human, highlighting the challenge of identifying the substrates of compositional processing in the brain. Our targeted dataset and evaluation pipeline lay the foundation for systematic, cross-modal evaluations of compositionality in both artificial and biological neural representations.
Topic Area: Language & Communication
Extended Abstract: Full Text PDF