A Deep Attention Model for Environmental Sound Classification from Multi-Feature Data
Abstract
:1. Introduction
2. Related Work
2.1. Traditional Methods
2.2. Deep Learning
3. Time–Frequency Attention Mechanism Model Based on Multi-Feature Parameters
3.1. Feature Extraction
3.2. Residual Model
4. Experiment and Analysis
4.1. Datasets
4.2. Model Training
4.3. Comparative Experiment
4.3.1. Experimental Results of Feature Map Parameters
4.3.2. Experimental Result Comparison of Network Models
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Baum, E.; Harper, M.; Alicea, R.; Ordonez, C. Sound identification for fire-fighting mobile robots. In Proceedings of the IEEE International Conference on Robotic Computing, Laguna Hills, CA, USA, 31 January–2 February 2018; pp. 79–86. [Google Scholar]
- Wang, J.C.; Lee, H.P.; Wang, J.F.; Lin, C.B. Robust environmental sound recognition for home automation. IEEE Trans. Autom. Sci. Eng. 2008, 5, 25–31. [Google Scholar] [CrossRef]
- Radhakrishnan, R.; Divakaran, A.; Smaragdis, A. Audio analysis for surveillance applications. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 16–19 October 2005; pp. 158–161. [Google Scholar]
- Cotton, C.V.; Ellis, D. Spectral vs. spectro-temporal features for acoustic event detection. In Proceedings of the 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 16–19 October 2011; pp. 69–72. [Google Scholar]
- Shi, Y.; Li, Y.Q.; Cai, M.L.; Zhang, X.D. A Lung Sound Category Recognition Method Based on Wavelet Decomposition and BP Neural Network. Int. J. Biol. Sci. 2019, 15, 195–207. [Google Scholar] [CrossRef] [PubMed]
- Geiger, J.T.; Helwani, K. Improving event detection for audio surveillance using Gabor filterbank features. In Proceedings of the 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015; pp. 714–718. [Google Scholar]
- Wang, J.C.; Wang, J.F.; He, K.W.; Hsu, C.S. Environmental sound classification using hybrid SVM/KNN classifier and MPEG-7 audio low-level descriptor. In Proceedings of the International Joint Conference on Neural Networks, Vancouver, BC, Canada, 16–21 July 2006; pp. 1731–1735. [Google Scholar]
- Ye, J.; Kobayashi, T.; Masahiro, M. Urban sound event classification based on local and global features aggregation. Appl. Acoust. 2017, 117, 246–256. [Google Scholar] [CrossRef]
- Bisot, V.; Serizel, R.; Essid, S.; Richard, G. Feature learning with matrix factorization applied to acoustic scene classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1216–1229. [Google Scholar] [CrossRef]
- Zhang, Y.; Wang, Y.; Zhou, G.; Jin, J.; Cichocki, A. Multi-kernel extreme learning machine for EEG classification in brain-computer interfaces. Exp. Syst. Appl. 2017, 96, 2. [Google Scholar] [CrossRef]
- Li, C.; Li, S.; Gao, Y.; Zhang, X.; Li, W. A Two-stream Neural Network for Pose-based Hand Gesture Recognition. IEEE Trans. Cogn. Dev. Syst. 2021, 1–10. [Google Scholar] [CrossRef]
- Tokozume, Y.; Harada, T. Learning environmental sounds with end-to-end convolutional neural network. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2721–2725. [Google Scholar]
- Dai, W.; Dai, C.; Qu, S.; Li, J.; Das, S. Very deep convolutional neural networks for raw waveforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 421–425. [Google Scholar]
- Zhu, B.; Xu, K.; Wang, D.; Zhang, L.; Li, B.; Peng, Y. Environmental sound classification based on multi-temporal resolution convolutional neural network combining with multi-level features. In Advances in Multimedia Information Processing—PCM 2018, Proceedings of the Pacific Rim Conference on Multimedia, Hefei, China, 21–22 September 2018; Springer: Cham, Switzerland, 2018; pp. 528–537. [Google Scholar]
- Abdoli, S.; Cardinal, P.; Koerich, A.L. End-to-end environmental sound classification using a 1D convolutional neural network. Expert Syst. Appl. 2019, 2, 252–263. [Google Scholar] [CrossRef]
- Chu, S.; Narayanan, S.; Kuo, C.-C.J. Environmental Sound Recognition with Time–Frequency Audio Features. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 1142–1158. [Google Scholar] [CrossRef]
- Piczak, K.J. Environmental sound classification with convolutional neural networks. In Proceedings of the 25th International Workshop on Machine Learning for Signal Processing, Boston, MA, USA, 17–20 September 2015; pp. 1–6. [Google Scholar]
- Salamon, J.; Bello, J.P. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 2017, 24, 279–283. [Google Scholar] [CrossRef]
- Zhang, Z.; Xu, S.; Cao, S.; Zhang, S. Deep convolutional neural network with mixup for environmental sound classification. In Pattern Recognition and Computer Vision. PRCV 2018, Proceedings of the Chinese Conference PRCV, Guangzhou, China, 23–26 November 2018; Springer: Cham, Switzerland, 2018; pp. 356–367. [Google Scholar]
- Dong, X.; Yin, B.; Cong, Y.; Du, Z.; Huang, X. Environment sound event classification with a two-stream convolutional neural network. IEEE Access 2020, 8, 125714–125721. [Google Scholar] [CrossRef]
- Piczak, K.J. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 1015–1018. [Google Scholar]
- Salamon, J.; Jacoby, C.; Bello, J.P. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 1041–1044. [Google Scholar]
- Guzhov, A.; Raue, F.; Hees, J.; Dengel, A. ESResNet: Environmental Sound Classification Based on Visual Domain Models. arXiv 2020, arXiv:2004.07301. [Google Scholar]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
- Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond empirical risk minimization. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Boddapati, V.; Petef, A.; Rasmusson, J.; Lundberg, L. Classifying environmental sounds using image recognition networks. Procedia Comput. Sci. 2017, 112, 2048–2056. [Google Scholar] [CrossRef]
- Wang, H.; Zou, Y.; Chong, D.; Wang, W. Learning discriminative and robust time-frequency representations for environmental sound classification. arXiv 2019, arXiv:1912.06808. [Google Scholar]
Feature Map Types | ESC-10 | ESC-50 | UrbanSound8K |
---|---|---|---|
Log-Mel spectrogram | 92.75 | 83.25 | 80.47 |
Phase spectrogram | 68.25 | 62.50 | 53.18 |
time–frequency spectrogram | 90.25 | 80.75 | 78.15 |
Phase, time–frequency spectrogram | 81.75 | 72.00 | 68.26 |
Log-Mel, time–frequency spectrogram | 93.75 | 85.25 | 81.32 |
Log-Mel, phase spectrogram | 95.00 | 86.75 | 81.63 |
Log-Mel, Phase, time–frequency spectrogram | 97.25 | 89.00 | 83.45 |
Sampling Frequency /Hz | ESC-10 | ESC-50 | UrbanSound8K |
---|---|---|---|
8 K | 95.25 | 86.50 | 81.31 |
16 K | 95.25 | 87.25 | 80.76 |
44.1 K | 97.25 | 89.00 | 83.45 |
48 K | 96.00 | 88.50 | 79.94 |
Frame Length | Frame Shift | Number of Mel Filters | Feature Map Size | Network Accuracy % (ESC-50) |
---|---|---|---|---|
1024 | 512 | 40 | (40, 431, 3) | 86.25 |
1024 | 512 | 64 | (64, 431, 3) | 86.25 |
2048 | 512 | 64 | (64, 431, 3) | 86.75 |
2048 | 1024 | 64 | (64, 216, 3) | 88.25 |
4096 | 2048 | 64 | (64, 108, 3) | 84.00 |
Time–Frequency Attention Mechanism Insertion Position | ESC-10 | ESC-50 | UrbanSound8K |
---|---|---|---|
None | 95.25 | 84.25 | 80.35 |
model input | 97.25 | 89.00 | 83.45 |
Between Residual blocks | 95.75 | 86.25 | 80.35 |
model output | 95.25 | 85.75 | 81.26 |
model input, output and Between Residual blocks | 92.50 | 82.25 | 78.66 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Guo, J.; Li, C.; Sun, Z.; Li, J.; Wang, P. A Deep Attention Model for Environmental Sound Classification from Multi-Feature Data. Appl. Sci. 2022, 12, 5988. https://0-doi-org.brum.beds.ac.uk/10.3390/app12125988
Guo J, Li C, Sun Z, Li J, Wang P. A Deep Attention Model for Environmental Sound Classification from Multi-Feature Data. Applied Sciences. 2022; 12(12):5988. https://0-doi-org.brum.beds.ac.uk/10.3390/app12125988
Chicago/Turabian StyleGuo, Jinming, Chuankun Li, Zepeng Sun, Jian Li, and Pan Wang. 2022. "A Deep Attention Model for Environmental Sound Classification from Multi-Feature Data" Applied Sciences 12, no. 12: 5988. https://0-doi-org.brum.beds.ac.uk/10.3390/app12125988