节点文献

基于深度学习的中文唇语识别系统设计

The Design of Chinese Lip-reading System Based on Deep Learning

【作者】 肖琦

【导师】 鲁远耀;

【作者基本信息】 北方工业大学 , 信息与通信工程, 2022, 硕士

【摘要】 随着深度学习技术的不断发展,人类社会逐步进入了一个人工智能的时代。人机交互技术作为人工智能中的一个分支,近年来也取得了巨大的进步。唇语识别技术作为人机交互的一种,也越来越受到关注。然而,有关唇语识别的研究中,大部分都集中在英文语料上,有关中文唇语识别方面的研究鲜有问津。这主要是由于中文的影响力不及英文,中文唇语识别技术的发展起步又较晚,缺少有影响力的中文唇语数据集,且对于中文词语识别的准确率并不理想。因此,基于上述诸多问题,本文设计开发了一款基于深度学习的中文唇语识别系统并自建了中文唇语数据集,旨在弥补有关中文唇语识别研究领域的空白,丰富中文唇语数据集,扩大中文唇语识别的影响力,使得研发出的系统落地具有实际应用意义,在未来可以服务于中国老百姓的日常生活。本文采用基于卷积注意力机制CBAM的ResNet50残差神经网络模型和基于Attention的GRU门循环单元网络模型,并将二者进行自适应融合,将融合完成的深度学习网络模型进行训练,并封装进设计好的中文唇语识别系统中进行运行。通过将本文设计的深度学习网络模型与已有的唇语识别领域中常用的11种CNN-RNN融合神经网络模型进行对比实验,验证了本文所设计的深度学习网络模型具有最好的性能和最高的稳定性。本文的具体工作如下:(1)对输入视频进行预处理。本文采用半随机抽取视频固定帧策略对输入视频进行抽帧处理,获得连续的包含关键信息的视频帧,同时对其做人脸检测与唇部定位,分割出连续的唇动帧序列,将这些连续的唇部图像序列作为一组输入。(2)改进CNN卷积神经网络,提升对于单张唇部图像空间特征的提取能力。在CNN网络部分,通过对比实验,本文最终决定采用ResNet50残差神经网络对唇部图像进行空间特征提取,并创新性地改进了ResNet50的卷积块ResBlock,向其中融入了CBAM卷积注意力机制,增强了其在处理卷积过程中的运算能力,提高了卷积过程中特征提取的性能。(3)改进RNN循环神经网络,增强了对于连续唇部图像的时间特征提取效果。在RNN网络部分,本文选择了GRU门循环控制单元,并在其中添加了Attention机制,这有助于为关键帧分配更多权重,忽略冗余信息的干扰,提升其对于时间特征的提取效果。(4)中文唇语识别系统的设计与实现。本文将上述两步中所使用的深度学习网络进行自适应融合,构建编码-解码形式的CNN-RNN融合神经网络,对连续的唇动图片序列进行处理。通过PyQt5进行页面设计与功能布局,构建完整的中文唇语识别系统。本文将训练完毕的深度学习网络模型封装进设计好的中文唇语识别系统内,通过在自建中文唇语数据集上的实验结果表明本文所设计的中文唇语识系统可以准确识别中文数字“零”到“九”和十个常用中文词语。与其它唇读系统相比,本文所设计的中文唇语识别系统具有更好的稳定性和更高的准确率,具有较好的性能。

【Abstract】 With the development of deep learning,human society has entered an era of artificial intelligence.As one of the technologies in artificial intelligence,human-computer interaction technology has also made great progress in recent years.Lip-reading,a kind of human-computer interaction,has also attracted much attention in recent years.However,in the research on lip-reading,most of the them is carried out on English corpus,seldom concerned on Chinese lip-reading.This is mainly because Chinese is less influential than English and the research on Chinese lip-reading starts late,lacks of influential Chinese lip-reading datasets,and the accuracy rate for Chinese word is not ideal.Therefore,based on these problems,this paper designs a Chinese lip-reading system based on deep learning.It aim to shorten the distances in the field of Chinese lip-reading,enrich the dataset of Chinese lip-reading,increased the influence of Chinese lip-reading and making the system more practical,serving for the Chinese people in the future.Our system takes the CBAM with ResNet50 as CNN and the Attention with GRU as RNN,then fused the two models.After that,we successfully applies it to our lip-reading system.Compared with 11 kinds of CNN-RNN fusion neural networks commonly used in the field of lip-reading,we draw the conclusion that the deep learning network model we designed in this paper has the best performance and the highest stability.The specific contributions of this paper are as follows:(1)Preprocessing for the original input video.We use a semi-random fixed frame extraction strategy to extract frames from the input video to obtain continuous video frames containing key information,and also do face detection and lip localization on the extracted single-frame images to segment the continuous lip-movement frame sequences and use these continuous image sequences as a set of inputs.(2)Improving CNN for image space feature extraction.In the CNN part,we select the ResNet50 as the convolutional neural network for feature extraction of images,and we innovatively improve the ResBlock of ResNet50 by adding the CBAM to it,enhances its ability to capture small differences between the accents of similar words in Chinese pronunciation and improves the performance of feature extraction during convolution.(3)Improving RNN for image temporal feature extraction.In the RNN part,we choose the GRU with Attention,which helps to extract features between consecutive lip motion images.Considering the influence of before and after moments in the lip-reading process on the current moment,we assign more weights to key frames,which makes the features more representative.(4)Building a Chinese lip-reading system.In this paper,the deep learning networks used in the above two steps are fused,the CNN-RNN fusion network is designed as a form of encoding-decoding to process continuous lip-movement image sequences.In this paper,we demonstrate that our Chinese lip-reading model can accurately recognize Chinese numbers 0-9 and ten Chinese words by wrapping our trained deep learning network model into the designed system and experimenting on the self-built dataset.Compared with other lip-reading systems,our system has better stability and higher recognition accuracy with better performance.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络