Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers

Poster Session A: Tuesday, August 12, 1:30 – 4:30 pm, de Brug & E‑Hall

Modeling human visual neural representations by combining vision deep neural networks and large language models

Boyan Rong1, Emrah Düzel, Radoslaw Martin Cichy1; 1Freie Universität Berlin

Presenter: Boyan Rong

Human visual perception involves transforming low-level visual features into high-level semantic representations. While deep neural networks (DNNs) trained on object recognition tasks have been promising models for predicting hierarchical visual processing, they often fail to capture higher-level semantic representations. In contrast, large language models (LLMs) encode rich semantic information that aligns with later stages of visual processing. Here we investigated whether combining vision DNN features and semantic embeddings from LLMs can better account for the neural dynamics of visual perception in electroencephalography (EEG) data. We demonstrated that their combination significantly improved the prediction of neural responses compared to either model alone. This approach outperformed multimodal modelling, and model comparison showed that the observed improvement was due to capturing complex information rather than a single factor.

Topic Area: Visual Processing & Computational Vision

Extended Abstract: Full Text PDF