Using Deep Learning for Speech Emotion Recognition

To enhance Speech Emotion Recognition (SER) through Deep Convolutional Neural Networks (DCNNs), we used the fast Continuous Wavelet Transform (fCWT). Its time-frequency analysis offers real-time CWT, providing improved results over the commonly used Short-Term Fourier Transform (STFT) based SER systems. We show that fCWT-DCNN-based SER shows promise for real-time high-accuracy SER for a wide range of applications.

Intelligent Environments (IE) hinge on effective two-way communication between humans and the environment, which requires high-quality signals to perform effectively and fluently. Here, speech is often considered to most natural form of communication. Following this, speech emotion recognition (SER) has become increasingly significant due to the critical role of paralinguistic aspects such as pitch, volume, and intonation in revealing a speaker's emotional state. SER is valuable in fields like Human-Computer Interaction, Human-Robot Interaction, and Intelligent Environments, as it facilitates understanding of human emotions for more natural and fluent interactions. However, it requires real-time processing of speech signals with utmost accuracy.

Time-Frequency Representations (TFR) provide a popular approach for processing and analyzing speech data for SER. These extract the temporal and spectral features of speech. Techniques such as Short-Time Fourier Transform (STFT) and Continuous Wavelet Transform (CWT) are often employed for this time-frequency analysis. Although there's typically a trade-off between analysis quality and computational burden, both factors hold equal significance when it comes to real-time SER.

Audio signal analysed by a Continuous Wavelet Transform (CWT) to extract the dynamic frequency contents.

The availability of SER databases has enabled deep learning (DL) to gain increasing success in speech processing. Many different studies have been conducted that utilize DL for SER, with varying success. In most cases, TFRs are used as input. However, TFRs deal with a fundamental resolution limitation caused by the uncertainty principle of signal processing. While STFT, typically used in TFR-based systems, offers a sub-optimal solution to these limitations, it often results in information loss. Despite the Deep Convolutional Neural Network's (DCNN) capacity to detect abstract, low-level data features, a high-quality TFR is still desired.

The Continuous Wavelet Transform (CWT) offers a solution by using different resolutions for different frequencies, also known as Multi-Resolution Analysis (MRA). However, its real-time computation on edge computing devices was challenging until a recent advancement in TFR: the Fast Continuous Wavelet Transform (fCWT) [1]. We propose that the MRA of fCWT offers advantages over STFT for DCNN-based SER. To validate this hypothesis, we replicated a previous study from Xia et. al. [2], utilizing both STFT and fCWT for comparison. As in the prior study, we utilize transfer learning on the ImageNet pre-trained Deep Convolutional Neural Network (DCNN) architecture AlexNet, and apply it to the same databases: eNTERFACE05 [3] and EMO-DB [4].

Speech Emotion Recognition (SER) is valuable in Human-Computer Interaction, Human-Robot Interaction, and Intelligent Environments, as it facilitates understanding of human emotions for more natural and fluent interactions.

eNTERFACE05 is an audiovisual dataset comprising 1,293 English utterances from 44 actors, equaling roughly 68 minutes of speech with 6 different emotions: anger, disgust, fear, happiness, sadness, and surprise. EMO-DB is an audio-only dataset containing 535 German utterances by 10 actors, amounting to approximately 25 minutes of speech. This dataset contains seven emotions: anger, disgust, fear, happiness, sadness, boredom, and a neutral state. Given the relatively small size of the eNTERFACE05 and EMO-DB datasets, we adopted data augmentation (DA) strategies to increase their size and enhance model robustness. The strategy called Random Circular Shift (RCS) was compared with a reference DA method, White Gaussian Noise (WGN).

With RCS, a TFR is circularly shifted on the time axis to a randomly selected time, looping back any excess to the beginning. In our study, we used RCS5, creating five new training samples for each entry. To match the number of entries as with RCS5, we used five levels of WGN, with signal-to-noise ratios of 10, 15, 20, 25, and 30 dB.

We employ leave-one-speaker-out and leave-one-speaker-group-out cross-validation strategies for the EMO-DB and eNTERFACE05 datasets, respectively. For eNTERFACE05, we leave out five speakers, resulting in nine folds, and for EMO-DB, one speaker is left out, generating ten folds. Both the validation and test sets utilize one fold.

The results of the evaluation can be found in Table 1. Here, For EMO-DB, fCWT only outperformed STFT with RCS5, whereas STFT surpassed fCWT when no DA and WGN were used. With eNTERFACE05, fCWT surpassed STFT when RCS5 and WGN were applied as DA, but STFT outperformed fCWT when no DA was utilized. In all situations, both fCWT and STFT-based TFR methods resulted in the best model performance an equal number of times. Therefore, our hypothesis that fCWT would surpass STFT-trained DCNNs for SER could not be confirmed. Whether STFT's fixed resolution or CWT's multi-MRA is better for SER remains an open question. The required fixed input size of the DCNN might be why fCWT did not perform better, as the image downscaling necessary for the TFR could have led to information loss. Full-scale fCWT TFR, which allows MRA, might contain more paralinguistic information and could therefore improve model performance.

Table 1: For theEMO-DB [23] and eNTERFACE05 [22] databases, accuracies of the fast ContinuousWavelet Transform (fCWT) and Short-Term Fourier Transform (STFT)-trained DeepConvolutional Neural Network (DCNN) models per Data Augmentation (DA) strategy:Random Circular Shift (RCS) and White Gaussian Noise (WGN).

The results can be difficult to interpret, potentially due to the training strategy adopted from Xia et. al. [2]. Figure 1a depicts a rapid decrease in training loss, while the validation loss reaches a plateau after a few epochs, indicating overfitting. There's also significant oscillation in the validation loss. These problems could stem from the training strategy due to three factors: 1) a high learning rate, 2) a suboptimal optimizer, and 3) an inadequately sized validation set. As a result, there's significant variance in the average model performance, making it necessary to exercise caution when comparing models. Figures 1b, 1c, and 1d depict the impact of modifying the training strategy by adopting a smaller learning rate, using the Adam optimizer, and enlarging the validation set, respectively. In each case, the models demonstrate a slower overfitting rate on the training set. Figures 3b and 3c also show reduced oscillations in the validation loss. However, Figure 3d continues to display substantial oscillation and a decrease in model performance, likely due to the smaller training set size, which subsequently reduces the model's generalizability. These observations suggest that alternative training strategies and hyperparameter tuning might lead to increased stability and improved model performance.

Figure 1: Entropyloss during Deep Convolutional Neural Network (DCNN) training and validation using different model training strategies. A) displays the original training strategy, B) a learning rate (lr) of 0.0001, C) the Adam optimizer, and D) a larger validation set. In all cases, the other hyperparameters are held constant using the original training strategy as reference. An arbitrary cross-validation fold is plotted that used the Short Term Fourier Transform(STFT) without data augmentation on the EMO-DB database

Deep learning's main vulnerability - lack of data - is highlighted once again. This is a common situation in IE contexts, where users and environments are inherently diverse. Securing reliable, high-quality signals for real-time processing could enable the use of models that are not dependent on large data volumes. Fine-tuning fCWT's hyperparameters might be a potential solution for this issue.

References

  1. Arts LPA, van den Broek EL. The fast Continuous Wavelet Transformation(fCWT) for real-time, high-quality, and noise-resistant time-frequencyanalysis. Nature Computational Science. 2022;2(1):47-58.doi:10.1038/s43588-021-00183-z
  2. XiaS, Fourer D, Audin L, Rouas JL, Shochi T. Speech Emotion Recognition usingTime-frequency Random Circular Shift and Deep Neural Networks. In: Frota S, Viǵario M, editors. Speech Prosody 2022. Lisbon, Portugal: Baixas, France: InternationalSpeech Communication Association (ISCA); 2022. p.585-9.doi:10.21437/SpeechProsody.2022-119.
  3. Martin O,Kotsia I, Macq B, Pitas I. The eNTERFACE’05 Audio-Visual Emotion Database. In:22nd International Conference on Data Engineering Workshops (ICDEW’06). Los Alamitos, CA, USA: IEEE ComputerSociety; 2006. p. #8. doi:10.1109/ICDEW.2006.145.
  4. BurkhardtF, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B. A database of German emotionalspeech. In: Trancoso I, editor. Proceedings of Interspeech 2005. Lisbon,Portugal: Baixas, France: International Speech Communication Association(ISCA); 2005. p. 1517-20. doi:10.21437/Interspeech.2005-446
Read paper