Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers

Poster Session B: Wednesday, August 13, 1:00 – 4:00 pm, de Brug & E‑Hall

Compositional Meaning in Vision-Language Models and the Brain

Maithe van Noort1, Luke Korthals1, Giacomo Aldegheri2, Marianne De Heer Kloots1, Micha Heilbron1; 1University of Amsterdam, 2Justus Liebig Universität Gießen

Presenter: Maithe van Noort

What is the role of compositional structure in the alignment of visual and linguistic brain areas to computational semantic embeddings? Vision-language models (VLMs) have shown meaningful alignment to the brain in their representations of semantic structure, for both images and text. However, the extent to which these representations capture compositional structure -- i.e. changes in meaning based on changes to the combinatorial structure of parts -- remains uncertain. Here we leverage Winoground, a dataset designed to test compositionality in multimodal representations, to compare the compositional structure captured by different model embeddings, as well as fMRI responses collected as part of a larger study on multi-modal meaning (with 2760 image and 2760 semantically equivalent language trials). In contrast to VLM embeddings, neural representations in the brain show a striking absence of compositional processing (chance level performance) when evaluated on the Winoground benchmark -- despite robust semantic encoding of individual concepts as measured by voxel activity predictions. This is intriguing as distinctions between stimuli in Winoground are trivial to any English-speaking human, highlighting the challenge of identifying the substrates of compositional processing in the brain. Our targeted dataset and evaluation pipeline lay the foundation for systematic, cross-modal evaluations of compositionality in both artificial and biological neural representations.

Topic Area: Language & Communication

Extended Abstract: Full Text PDF