节点文献

基于时空序列残差网络融合的人体行为识别研究

The Research on Human Behavior Recognition Based on Temporal and Spatial Sequence Fusion

【作者】 王凯

【导师】 缑新科;

【作者基本信息】 兰州理工大学 , 控制理论与控制工程, 2021, 硕士

【摘要】 人体行为识别技术的应用与发展,兼具着理论与现实意义。目前在该研究领域内,相比传统方法,卷积神经网络的识别效果更加理想,但也存在一定的局限性,当输入一些长时间序列视频信息时,网络对时间序列特征相关性的表达能力不足,并且不能显著地提取时间序列特征信息,无法在长时间序列视频中捕捉到人体行为特征的依赖关系,因而很容易丢失一些重要的特征信息。论文通过研究三维残差卷积神经网络融合非局部卷积网络的算法,提取长时间序列视频中具有依赖关系的特征信息,达到提高人体行为识别准确率的目的。主要提出两种研究方法:1.首先采用时间分段网络对长时间序列视频特征进行稀疏采样,减少卷积网络中冗余信息的数量,并且避免由于“类间差异性”导致网络模型性能下降的问题,使用TV-L1算法从相邻RGB视频帧中提取能够表达行为时序性的Flow光流帧;其次将三维残差卷积网络与非局部卷积网络融合,构建新的行为识别网络结构,并学习双流卷积神经网络特征融合的原理,将具有分段性的RGB视频帧和堆叠性的Flow光流帧分别输入到卷积网络进行融合训练;最后对训练好的网络模型进行验证,分析表明该网络模型具有较好的空间特征提取和特征表达能力,并且能够捕捉长时间序列的特征依赖关系,提高了对人体行为的识别准确率。2.在保持稀疏采样和双流三维残差卷积网络的基础上,为了避免耗费大量的时间计算Flow光流帧,使用RGB视频帧代替Flow光流帧进行时空序列残差网络融合的人体行为识别研究。改变卷积神经网络的空间网络输入维度与时间网络输入维度,并且将时间网络提取的特征横向连接到空间网络,结合非局部卷积神经网络进行特征识别,最后通过softmax函数进行分类,完成基于视频的人体行为识别,实现了端到端识别。在UCF101和HMDB51数据集上对论文提出的两种算法进行训练和验证实验,并且将实验结果与其他人体行为识别算法的结果进行对比,表明了时空序列残差网络融合模型在行为识别方面的高效性。最后,结合时空序列残差网络融合的人体行为识别网络模型,设计端到端的人体行为识别系统,根据系统工作流程搭建系统并开发软件的框架,使系统具有视频读取、视频帧计算和实时测试等功能,实现了对人体行为的可视化识别。

【Abstract】 The application and development of human behavior recognition technology have both theoretical and practical significance.At present,in this research field,compared with traditional methods,the recognition effect of convolutional neural networks is more ideal,but there are also certain limitations.When inputting some long-time sequence video information,the network’s ability to express the correlation of time series features is insufficient,and can not significantly extract the time series feature information,can not capture the dependence of human behavior in the long-term video sequence,so it is easy to lose some important features.This thesis mainly studies the three-dimensional residual convolutional neural networks fusion algorithm of non-local convolutional neural networks,extracts the feature information that has dependencies in the long-term sequence videos,and achieves the purpose of improving the accuracy of human behavior recognition.Two research methods are proposed:1.First,the temporal segmentation networks is used to sparsely sample the longtime sequence video features,reduce the amount of redundant information in the convolutional networks,and avoid the problem of network model performance degradation due to "inter-class differences",and use TV-L1 algorithm extracts Flow optical flow frames that can express the temporality of behavior from adjacent RGB video frames;secondly,the three-dimensional residual convolutional networks and the non-local convolutional networks are merged to build a new behavior recognition network structure,and learn two-stream convolution the principle of neural networks feature fusion is to input segmented RGB video frames and stacked Flow optical flow frames into the convolutional networks for fusion training;finally,the trained network model is verified,and the analysis shows that the network model has better spatial feature extraction and feature expression capabilities,and can capture the feature dependence of long-term sequences,which improves the recognition accuracy of human behavior.2.On the basis of maintaining sparse sampling and two-stream three-dimensional residual convolutional networks,avoid spending a lot of time calculating Flow optical flow frames,and use RGB video frames instead of Flow optical flow frames for spatialtemporal sequence residual networks fusion human behavior recognition research,change the spatial networks input dimension and temporal networks input dimension of the convolutional neural networks,and connect the features extracted by the temporal networks to the spatial networks horizontally,combine the non-local convolutional neural networks for feature recognition,and finally classify through the softmax function to complete human behavior recognition based on video realizes endto-end recognition.The two algorithms proposed in this thesis are trained and verified on the UCF101 and HMDB51 datasets,and the experimental results are compared with the results of other human behavior recognition algorithms,which shows that the spatial-temporal sequence residual networks fusion model is efficient in behavior recognition performance.Finally,combined with the human behavior recognition network model fusion of the spatial-temporal sequence residual networks,designed an end-to-end human behavior recognition system,build the system according to the system workflow,and develop the software framework,so that the system has video reading,video frame calculation and real-time testing and other functions,realize the visual recognition of human behavior.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络