Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features
Loading...
Date
2022
Journal Title
Journal ISSN
Volume Title
Publisher
IEEE
Abstract
The detection and classification of emotional states in speech involves the analysis of audio
signals and text transcriptions. There are complex relationships between the extracted features at different
time intervals which ought to be analyzed to infer the emotions in speech. These relationships can be
represented as spatial, temporal and semantic tendency features. In addition to emotional features that
exist in each modality, the text modality consists of semantic and grammatical tendencies in the uttered
sentences. Spatial and temporal features have been extracted sequentially in deep learning-based models
using convolutional neural networks (CNN) followed by recurrent neural networks (RNN) which may
not only be weak at the detection of the separate spatial-temporal feature representations but also the
semantic tendencies in speech. In this paper, we propose a deep learning-based model named concurrent
spatial-temporal and grammatical (CoSTGA) model that concurrently learns spatial, temporal and semantic
representations in the local feature learning block (LFLB) which are fused as a latent vector to form an input
to the global feature learning block (GFLB). We also investigate the performance of multi-level feature
fusion compared to single-level fusion using the multi-level transformer encoder model (MLTED) that we
also propose in this paper. The proposed CoSTGA model uses multi-level fusion first at the LFLB level
where similar features (spatial or temporal) are separately extracted from a modality and secondly at the
GFLB level where the spatial-temporal features are fused with the semantic tendency features. The proposed
CoSTGA model uses a combination of dilated causal convolutions (DCC), bidirectional long short-term
memory (BiLSTM), transformer encoders (TE), multi-head and self-attention mechanisms. Acoustic and
lexical features were extracted from the interactive emotional dyadic motion capture (IEMOCAP) dataset.
The proposed model achieves 75.50% and 75.82% of weighted and unweighted accuracy, 75.32% and
75.57% of recall and F1 score respectively. These results imply that concurrently learned spatial-temporal
features with semantic tendencies learned in a multi-level approach improve the model’s effectiveness and
robustness.
Description
Keywords
Emotion recognition,, Spatial features, Temporal features, Semantic tendency features, Multi- head attention