Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers
Poster Session B: Wednesday, August 13, 1:00 – 4:00 pm, de Brug & E‑Hall
Demographic Prompting Fails to Bridge the Individual Variability Gap: GPT-4o Aligns with Average but Not Individual Emotional Ratings of Images
Chace Ashcraft1, Raphael Norman-Tenazas, Ritwik Bose1, Michael Wolmetz1, Mattson Ogg1; 1Johns Hopkins University Applied Physics Laboratory
Presenter: Chace Ashcraft
Large language models (LLMs) and vision language models (VLMs) have been shown to closely align with human behavior in aggregate, but tend to align less well with individuals, and poorly approximate the variability of cohorts of human agents. We explored aligning models to specific individuals based on their demographic data on an emotion rating task by eliciting ratings along two standard psychological emotion dimensions on the previously human-normed OASIS dataset. We created AI "proxy" participants for human participants in the original OASIS study by prompting GPT-4o with a human participant's demographic data, then instructed the AI participant to rate a set of images for emotional valence or arousal, reproducing the human paradigm. We found that group-averaged GPT-4o ratings correlated to group-averaged human responses, but observed different distributions of responses. Representations of specific individuals poorly aligned with human ratings, despite using specific demographic data. In general, GPT-4o appears to align fairly well with human emotional responses on average, but work is needed to capture human variability to enable VLMs to emulate the behavior of specific individuals.
Topic Area: Reward, Value & Social Decision Making
Extended Abstract: Full Text PDF