Poster Presentation

Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers

Poster Session B: Wednesday, August 13, 1:00 – 4:00 pm, de Brug & E‑Hall

Demographic Prompting Fails to Bridge the Individual Variability Gap: GPT-4o Aligns with Average but Not Individual Emotional Ratings of Images

Chace Ashcraft¹, Raphael Norman-Tenazas, Ritwik Bose¹, Michael Wolmetz¹, Mattson Ogg¹; ¹Johns Hopkins University Applied Physics Laboratory

Presenter: Chace Ashcraft

Large language models (LLMs) and vision language models (VLMs) have been shown to closely align with human behavior in aggregate, but tend to align less well with individuals, and poorly approximate the variability of cohorts of human agents. We explored aligning models to specific individuals based on their demographic data on an emotion rating task by eliciting ratings along two standard psychological emotion dimensions on the previously human-normed OASIS dataset. We created AI "proxy" participants for human participants in the original OASIS study by prompting GPT-4o with a human participant's demographic data, then instructed the AI participant to rate a set of images for emotional valence or arousal, reproducing the human paradigm. We found that group-averaged GPT-4o ratings correlated to group-averaged human responses, but observed different distributions of responses. Representations of specific individuals poorly aligned with human ratings, despite using specific demographic data. In general, GPT-4o appears to align fairly well with human emotional responses on average, but work is needed to capture human variability to enable VLMs to emulate the behavior of specific individuals.

Topic Area: Reward, Value & Social Decision Making

Extended Abstract: Full Text PDF