Browsing by Author "Alwin, Poulose"
Now showing 1 - 2 of 2
Results Per Page
- ItemAttention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution(IEEE, 2022-11-21) Samuel, Kakuba; Alwin, PouloseThe success of deep learning in speech emotion recognition has led to its application in resource-constrained devices. It has been applied in human-to-machine interaction applications like social living assistance, authentication, health monitoring and alertness systems. In order to ensure a good user experience, robust, accurate and computationally efficient deep learning models are necessary. Recurrent neural networks (RNN) like long short-term memory (LSTM), gated recurrent units (GRU) and their variants that operate sequentially are often used to learn time series sequences of the signal, analyze long-term dependencies and the contexts of the utterances in the speech signal. However, due to their sequential operation, they encounter problems in convergence and sluggish training that uses a lot of memory resources and encounters the vanishing gradient problem. In addition, they do not consider spatial cues that may exist in the speech signal. Therefore, we propose an attention-based multi-learning model (ABMD) that uses residual dilated causal convolution (RDCC) blocks and dilated convolution (DC) layers with multi-head attention. The proposed ABMD model achieves comparable performance while taking global contextualized long-term dependencies between features in a parallel manner using a large receptive field with less increase in the number of parameters compared to the number of layers and considers spatial cues among the speech features. Spectral and voice quality features extracted from the raw speech signals are used as inputs. The proposed ABMD model obtained a recognition accuracy and F1 score of 93.75% and 92.50% on the SAVEE datasets, 85.89% and 85.34% on the RAVDESS datasets and 95.93% and 95.83% on the EMODB datasets. The model’s robustness in terms of the confusion ratio of the individual discrete emotions especially happiness which is often confused with emotions that belong to the same dimensional plane with it also improved when validated on the same datasets
- ItemDeep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features(IEEE, 2022) Samuel, Kakuba; Alwin, Poulose; Dong, Seog Han; Senior Member, IeeeThe detection and classification of emotional states in speech involves the analysis of audio signals and text transcriptions. There are complex relationships between the extracted features at different time intervals which ought to be analyzed to infer the emotions in speech. These relationships can be represented as spatial, temporal and semantic tendency features. In addition to emotional features that exist in each modality, the text modality consists of semantic and grammatical tendencies in the uttered sentences. Spatial and temporal features have been extracted sequentially in deep learning-based models using convolutional neural networks (CNN) followed by recurrent neural networks (RNN) which may not only be weak at the detection of the separate spatial-temporal feature representations but also the semantic tendencies in speech. In this paper, we propose a deep learning-based model named concurrent spatial-temporal and grammatical (CoSTGA) model that concurrently learns spatial, temporal and semantic representations in the local feature learning block (LFLB) which are fused as a latent vector to form an input to the global feature learning block (GFLB). We also investigate the performance of multi-level feature fusion compared to single-level fusion using the multi-level transformer encoder model (MLTED) that we also propose in this paper. The proposed CoSTGA model uses multi-level fusion first at the LFLB level where similar features (spatial or temporal) are separately extracted from a modality and secondly at the GFLB level where the spatial-temporal features are fused with the semantic tendency features. The proposed CoSTGA model uses a combination of dilated causal convolutions (DCC), bidirectional long short-term memory (BiLSTM), transformer encoders (TE), multi-head and self-attention mechanisms. Acoustic and lexical features were extracted from the interactive emotional dyadic motion capture (IEMOCAP) dataset. The proposed model achieves 75.50% and 75.82% of weighted and unweighted accuracy, 75.32% and 75.57% of recall and F1 score respectively. These results imply that concurrently learned spatial-temporal features with semantic tendencies learned in a multi-level approach improve the model’s effectiveness and robustness.