Poster Presentation

Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers

Poster Session B: Wednesday, August 13, 1:00 – 4:00 pm, de Brug & E‑Hall

Concerns in evaluating hierarchical correlations between human brains and end-to-end automatic speech recognition models

Yi Wang¹, Peter Bell¹; ¹University of Edinburgh, University of Edinburgh

Presenter: Yi Wang

Recent neural encoding studies have attempted to compare the human brain speech perception network with artificial neural network models trained end-to-end (e2e) on automatic speech recognition (ASR), aiming to reveal the temporal dynamics of human speech processing. Multiple studies have reported a prominent correspondence between e2e ASR models and human brains in terms of the hierarchical encoding of linguistic features, from low-level acoustic features to high-level semantic features. While different types of e2e ASR models have been used to investigate this correspondence, there has not been a consensus on the most suitable ASR model type for such investigations. This extended abstract will discuss concerns regarding the use of three mainstream types of e2e ASR models when evaluating their hierarchical correlation with human speech perception network, including the recurrent neural network transducer, the attention-based encoder-decoder model using tokenizer (i.e. Whisper) and the self-supervised transformer model. We suggest that further caution is required when using these models in the hierarchical correlation studies, due to issues such as varying decoding latency, mismatched context window and difficulty in representation disentanglement inherent in each model type, respectively.

Topic Area: Language & Communication

Extended Abstract: Full Text PDF