节点文献

基于Transformer编码器的多级表示与融合特征输入的语音情感识别方法(英文)

Transformer encoder-based multilevel representations with fusion feature input for speech emotion recognition

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 贺正然沈起帆吴佳欣徐梦瑶赵力

【Author】 He Zhengran;Shen Qifan;Wu Jiaxin;Xu Mengyao;Zhao Li;School of Information Science and Engineering,Southeast University;School of Electronic Science and Engineering,Southeast University;School of Computer Science and Software Engineering,University of Stirling;

【通讯作者】 赵力;

【机构】 东南大学信息科学与工程学院东南大学微电子学院School of Computer Science and Software Engineering,University of Stirling

【摘要】 为了提高语音情感识别的准确度,探讨了将Transformer应用于语音情感识别的可能性.将对数梅尔尺度谱图及其一阶差分特征相融合作为输入,使用Transformer来提取分层语音表示,分析注意头个数和Transformer编码器层数的变化对识别精度的影响.结果表明,在ABC、CASIA、DES、EMODB和IEMOCAP语音情感数据库上,相比以MFCC为特征的Transformer,所提模型的精度分别提高了13.98%、8.14%、24.34%、8.16%和20.9%.该模型表现优于递归神经网络(RNN)、卷积神经网络(CNN)、Transformer等其他模型.

【Abstract】 To improve the accuracy of speech emotion recognition(SER), the possibility of applying transformer-based SER is explored. The log Mel-scale spectrogram and its first-order differential feature are fused as the input to extract hierarchical speech representations using the transformer. The effects of the variation in the number of attention heads and the number of transformer-encoder layers on the recognition accuracy are discussed. The results show that the accuracy of the proposed model increased by 13.98%, 8.14%, 24.34%, 8.16%, and 20.9% compared with that of the transformer with the Mel-frequency cepstral coefficient as featured on the ABC, CASIA, DES, EMODB, and IEMOCAP databases, respectively. Compared with recurrent neural networks, convolutional neural networks, transformer-based models, and other models, the proposed model performs better.

【基金】 The Key Research and Development Program of Jiangsu Province (No. BE2022059-3)
  • 【文献出处】 Journal of Southeast University(English Edition) ,东南大学学报(英文版) , 编辑部邮箱 ,2023年01期
  • 【分类号】TN912.34
  • 【下载频次】91
节点文献中: 

本文链接的文献网络图示:

本文的引文网络