Marathi Speech Emotion recognition using Deep Learning techniques.

##plugins.themes.bootstrap3.article.main##

Akhilesh Ketkar
Divyansh Mishra
Madhur Nirmal
Faizan Mulla
Vaibhav Narawade

Abstract

In the project, an emotion recognition system from speech is proposed using deep learning. The goal of this project is to classify a speech signal into one of the five emotions listed below: anger, boredom, fear, happiness, and sadness. Snippets below from numerous Marathi movies and TV shows were used to construct the dataset for Marathi language samples which include 20 audio samples for anger, 19 for boredom, 5 for fear, and 11 for happiness. The proposed system first processes a speech signal from the time domain to the frequency domain using Discrete Time Fourier Transform (DTFT). Then, data augmentation is performed which includes noise injection, stretching, shifting, and pitch scaling of the speech signal. Next, feature extraction is performed in which 5 features were selected, which include Mel Frequency Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR), Chroma STFT, Mel Spectrogram, and Root mean square value. These features were then fed to a Convolutional Neural Network (CNN). The efficiency of the suggested system employing the CNNs is supported by experimental findings. This model’s accuracy on the test data is 80.33%, and its f1 values for anger, boredom, fear, happiness, and sadness are 0.85, 0.83, 0.50, 0.62, and 0.84, respectively.

##plugins.themes.bootstrap3.article.details##

Section
Articles

References

[1] R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar and T. Alhussain, "Speech Emotion Recognition Using Deep Learning Techniques: A Review," in IEEE Access, vol. 7, pp. 117327-117345, 2019, doi: 10.1109/ACCESS.2019.2936124.

[2] W. Lim, D. Jang and T. Lee, "Speech emotion recognition using convolutional and Recurrent Neural Networks," 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016, pp. 1-4, doi: 10.1109/APSIPA.2016.7820699.

[3] Yoon, WJ., Park, KS. (2007). A Study of Emotion Recognition and Its Applications. In: Torra, V., Narukawa, Y., Yoshida, Y. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2007. Lecture Notes in Computer Science(), vol 4617. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73729-2_43.

[4] M. S. Akhtar, A. Ekbal and E. Cambria, "How Intense Are You? Predicting Intensities of Emotions and Sentiments using Stacked Ensemble [Application Notes]," in IEEE Computational Intelligence Magazine, vol. 15, no. 1, pp. 64-75, Feb. 2020, doi: 10.1109/MCI.2019.2954667.

[5] M.Shamim Hossain , Ghulam Muhammad , Emotion Recognition Using Deep Learning Approach from Audio-Visual Emotional Big Data, Information Fusion (2018), doi:https://doi.org/10.1016/j.inffus.2018.09.008.

[6] K. -Y. Huang, C. -H. Wu, M. -H. Su and H. -C. Fu, "Mood detection from daily conversational speech using denoising autoencoder and LSTM," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5125-5129, doi: 10.1109/ICASSP.2017.7953133.

[7] E. Lieskovska, M. Jakubec and R. Jarina, "Speech Emotion Recognition Overview and Experimental Results," 2020 18th International Conference on Emerging eLearning Technologies and Applications (ICETA), 2020, pp. 388-393, doi: 10.1109/ICETA51985.2020.9379218..

[8] Araño, K.A., Gloor, P., Orsenigo, C. et al. When Old Meets New: Emotion Recognition from Speech Signals. Cogn Comput 13, 771–783 (2021). https://doi.org/10.1007/s12559-021-09865-2.

[9] P. Tzirakis, J. Zhang and B. W. Schuller, "End-to-End Speech Emotion Recognition Using Deep Neural Networks," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5089-5093, doi: 10.1109/ICASSP.2018.8462677.

[10] T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi and E. Ambikairajah, "A Comprehensive Review of Speech Emotion Recognition Systems," in IEEE Access, vol. 9, pp. 47795-47814, 2021, doi: 10.1109/ACCESS.2021.3068045.

[11] Liu, M. English speech emotion recognition method based on speech recognition. Int J Speech Technol 25, 391–398 (2022). https://doi.org/10.1007/s10772-021-09955-4.