基於臉部及語音特徵之輕量化深度學習情感辨識系統

No Thumbnail Available

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

因應近年來高齡化導致老人照護人力缺乏,本研究提出了一種可被應用於陪伴型機器人(Zenbo Junior II)上的整合臉部表情和語音的情感識別輕量化模型。近年來對於人類的情感識別技術大多使用基於卷積神經網路(Convolutional Neural Network, CNN)的方式來實現,並得到了優秀的成果,然而,這些先進的技術都沒有考慮計算成本的問題,導致這些技術在計算能力有限的設備上無法運行(例如,陪伴型機器人)。因此,本研究將輕量化的GhostNet模型,應用於臉部情感識別的模型,並將輕量化的一維卷積神經網路(One Dimensional Convolutional Neural Network, 1D-CNN)作為語音情感識別模型,再利用幾何平均數的方式將兩個模態預測的結果整合。所提出的模型,在RAVDESS和CREMA-D兩個數據集上分別取得了97.56%及82.33%的準確率,在確保了高準確率的情況下,本研究將參數量壓縮到了0.92M,浮點運算次數減少至0.77G,比起目前已知的先進技術要少了數十倍。最後,將本研究的模型實際部署在Zenbo Junior II中,並透過模型與硬體的運算強度作比較,得知本研究的模型能夠更加順利的在該硬體中運行,且臉部及語音情感識別模型的推理時間分別只有1500毫秒及12毫秒。
According to the shortage of human resources to take care of the elderly due to the aging population in recent years, this study proposes a lightweight model that integrates facial and speech emotion recognition and can be applied to a companion robot, Zenbo Junior II. In recent years, most of the human emotion recognition techniques have been implemented using Convolutional Neural Network (CNN) based approaches and have achieved excellent results. However, these advanced techniques do not take into account the computational cost, which makes them unworkable on devices with limited computational power, including companion robots. Thus, this study constructs a more lightweight GhostNet as a model for facial emotion recognition and a lightweight 1D-CNN as a model for speech emotion recognition, and utilizes the geometric mean to predict the two modalities. The results of the two modalities are integrated in the RAVDESS and CREMA-D datasets, achieving 97.56% and 82.33% accuracy. The number of parameters was compressed to 0.92M and the floating-point operations was reduced to 0.77G, which is tens of times less than that of the state-of-the-art technology, with high accuracy. Finally, the model was actually deployed in Zenbo Junior II, and by comparing the computational intensity of the model and the hardware, it was learned that the model was able to run more smoothly in Zenbo Junior II, and the inference time of face and speech emotion recognition models are only 1500 ms and 12 ms.

Description

Keywords

深度學習, 雙模態情感識別, 輕量化模型, 卷積神經網路, 陪伴型機器人, Deep Learning, Bimodal Emotion Recognition, Lightweight Models, Convolutional Neural Networks, Companion Robots

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By