节点文献
基于Transformer编码器的多级表示与融合特征输入的语音情感识别方法(英文)
Transformer encoder-based multilevel representations with fusion feature input for speech emotion recognition
【摘要】 为了提高语音情感识别的准确度,探讨了将Transformer应用于语音情感识别的可能性.将对数梅尔尺度谱图及其一阶差分特征相融合作为输入,使用Transformer来提取分层语音表示,分析注意头个数和Transformer编码器层数的变化对识别精度的影响.结果表明,在ABC、CASIA、DES、EMODB和IEMOCAP语音情感数据库上,相比以MFCC为特征的Transformer,所提模型的精度分别提高了13.98%、8.14%、24.34%、8.16%和20.9%.该模型表现优于递归神经网络(RNN)、卷积神经网络(CNN)、Transformer等其他模型.
【Abstract】 To improve the accuracy of speech emotion recognition(SER), the possibility of applying transformer-based SER is explored. The log Mel-scale spectrogram and its first-order differential feature are fused as the input to extract hierarchical speech representations using the transformer. The effects of the variation in the number of attention heads and the number of transformer-encoder layers on the recognition accuracy are discussed. The results show that the accuracy of the proposed model increased by 13.98%, 8.14%, 24.34%, 8.16%, and 20.9% compared with that of the transformer with the Mel-frequency cepstral coefficient as featured on the ABC, CASIA, DES, EMODB, and IEMOCAP databases, respectively. Compared with recurrent neural networks, convolutional neural networks, transformer-based models, and other models, the proposed model performs better.
【Key words】 speech emotion recognition; transformer; multihead attention mechanism; fusion feature;
- 【文献出处】 Journal of Southeast University(English Edition) ,东南大学学报(英文版) , 编辑部邮箱 ,2023年01期
- 【分类号】TN912.34
- 【下载频次】91