Contributed Talk Sessions | Poster Sessions | All Posters | Search Papers
Poster Session B: Wednesday, August 13, 1:00 – 4:00 pm, de Brug & E‑Hall
Concerns in evaluating hierarchical correlations between human brains and end-to-end automatic speech recognition models
Yi Wang1, Peter Bell1; 1University of Edinburgh, University of Edinburgh
Presenter: Yi Wang
Recent neural encoding studies have attempted to compare the human brain speech perception network with artificial neural network models trained end-to-end (e2e) on automatic speech recognition (ASR), aiming to reveal the temporal dynamics of human speech processing. Multiple studies have reported a prominent correspondence between e2e ASR models and human brains in terms of the hierarchical encoding of linguistic features, from low-level acoustic features to high-level semantic features. While different types of e2e ASR models have been used to investigate this correspondence, there has not been a consensus on the most suitable ASR model type for such investigations. This extended abstract will discuss concerns regarding the use of three mainstream types of e2e ASR models when evaluating their hierarchical correlation with human speech perception network, including the recurrent neural network transducer, the attention-based encoder-decoder model using tokenizer (i.e. Whisper) and the self-supervised transformer model. We suggest that further caution is required when using these models in the hierarchical correlation studies, due to issues such as varying decoding latency, mismatched context window and difficulty in representation disentanglement inherent in each model type, respectively.
Topic Area: Language & Communication
Extended Abstract: Full Text PDF