Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers

Poster Session A: Tuesday, August 12, 1:30 – 4:30 pm, de Brug & E‑Hall

Learning task-relevant visual features from large language model (LLM) embeddings

Michelle R. Greene1, Bruce Hansen2; 1Barnard College, 2Colgate University

Presenter: Michelle R. Greene

Complete visual understanding requires an analysis of the environment’s features while considering the observer's goals. The visual system is not a passive feature extractor. Instead, perception constructs representations that are tailored to behavioral needs. While identifying task-relevant visual features in complex scenes is challenging, advances in AI may offer new opportunities to address these challenges. We asked observers to describe the same set of scenes with two different goals: to provide a general scene description or to describe the possible walking paths through the scene. We converted these descriptions to sentence embeddings and trained convolutional neural networks (CNNs) to learn the embeddings. We included the embeddings of scenes’ basic-level categories as a baseline. Using deconvolution, we generated activation maps to reveal the image areas that contain task-relevant semantic information. All networks generated maps that were unique from each other and to a CNN pretrained for scene classifications. The maps generated from the navigation task contained higher activation to the ground plane compared to the description map. We validated the image information by showing human participants (N=60) partial image views that contained either the top or bottom quarter of activated pixels for each network while they performed a three alternative forced choice (3AFC) task requiring either categorization or navigation. Higher activation pixels led to better performance, indicating that task-relevant scene features can be learned directly from written descriptions.

Topic Area: Object Recognition & Visual Attention

Extended Abstract: Full Text PDF