Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers
Poster Session A: Tuesday, August 12, 1:30 – 4:30 pm, de Brug & E‑Hall
Many-to-Many, Yet Convergent: Insights into the alignment of Vision and Language Models
Zoe Wanying He1, Sean Trott1, Meenakshi Khosla1; 1University of California, San Diego
Presenter: Zoe Wanying He
The “platonic representation” hypothesis holds that vision and language models converge on a shared conceptual space despite being trained on distinct modalities. Yet, much of the evidence for this hypothesis comes from one-to-one image–caption scenarios, where each image is paired with a single descriptive caption. This setup overlooks a fundamental reality: the mapping between images and language is many-to-many, as neither modality uniquely determines the other. In this work, we show that alignment between vision and language models also persists at a finer grain in such many-to-many contexts. Using a forced-choice “Pick-a-Pic” task, we find that human raters’ preferences for which of two images better matches a caption are mirrored in the learned embedding space across all vision-language model pairs. This evidence challenges the simplistic view of “one image, one caption” alignment and highlights that models capture finer-grained semantic distinctions akin to human preferences. Moreover, we demonstrate that averaging embeddings across multiple images and multiple captions referring to a shared concept yields significantly stronger alignment than individual image–caption pairs. While one might expect averaging to “blur” representational detail, our results reveal the opposite: aggregating multiple views appears to distill a more universal semantic core. Our findings ultimately reinforce the notion of a shared conceptual space across modalities, underscoring the importance of examining many-to-many correspondences to better understand how such models learn, represent, and unify semantic information.
Topic Area: Language & Communication
Extended Abstract: Full Text PDF