节点文献

基于深度学习的视频人体动作识别

Video Based Human Action Recognition with Deep Learning

【作者】 王云峰

【导师】 李厚强; 周文罡;

【作者基本信息】 中国科学技术大学 , 信息与通信工程, 2018, 硕士

【摘要】 随着计算能力的大幅提高和大规模带标签的图像和视频数据集的提出,深度学习(Deep Learning)在计算机视觉(Computer Vision)领域的各个任务上(如图像分类、语义分割、目标检测等)取得了巨大的成功。在视频动作识别任务中,基于多路神经网络和3D卷积神经网络的深度学习架构取得了目前最好的性能。然而由于常用的网络结构不包含显式地学习视频中的上下文信息和视觉属性的模块,使得这些深度学习算法对视频动作识别有重要影响的区域或信息建模不足。另外动作在视频中的发生时间是不固定的,如何有效地将网络的注意力放到有动作的区域,是视频动作识别中一个需要解决的问题,而目前这方面的研究工作还比较少。针对上述这些问题,本文做出了如下的研究工作:首先我们提出了基于语义注意力模型的双路卷积神经网络,将视频中上下文信息的学习加入到网络中,提高了对视频中动作的识别能力。基于多路卷积神经网络(Multi-Stream ConvNet)的深度学习方法是视频动作识别任务中一类广泛使用的方法。这类方法首先分别学习多个域或多个模态的的特征,然后采用特征融合的方式,将多个域或多个模态的信息有效地聚合起来。然而另一方面,视频中包含许多对视频理解有帮助作用的上下文信息(Context)和语义信息,合理地利用这些信息能够对视频中的人体动作识别任务提供有效的帮助。我们在基础的多路神经网络的基础上,增加了由上下文信息构成的语义注意力模块,通过使用物体检测算法得到上下文信息候选区域,之后将这些候选区域输入到ROI-pooling层,加入到网络的训练过程中,之后提取对应候选区域的响应图,输入到全连接的网络层中,经过加权相加得到最后的识别概率。然后我们提出了基于视觉属性发掘的3D卷积网络,利用3D卷积网络来学习视频的表达,进而对动作进行识别,解决了现有网络在空间模式和时间模式都很相似的视频上的误分类问题。3D卷积网络作为一种学习时间和空间上的信息的网络结构,广泛地应用在视频理解和分析任务中。虽然3D网络结构在视频动作识别任务中取得了优异的性能,但是由于缺乏对视频中的视觉属性的显式的学习,因此对于某些空间整体模式和时间运动模式上很相似的视频类别,3D卷积网络无法区分出来。为了有效地解决上面提到的这个问题,我们提出了运用视觉属性(Visual Attributes)的发掘来提升3D卷积的算法,利用成熟的物体检测算法和自然语言处理领域中的算法来从视频中发现有用的视觉属性,然后将视觉属性和视频关联起来,对视觉属性进行网络进行识别。最后我们提出了泛化的注意力池化模型,用包含注意力模型模块的卷积网络来进行动作识别,提高了网络的表达力,使得适用场景更广泛。在视频中,动作是一个持续一段时间,且出现时间不确定的某种模式或运动,而大部分视频片段是没有动作的。因此,应用注意力模型可以发现视频中的有动作的片段和动作发生的空间位置。基于此,我们提出了泛化的注意力池化模型GAP(Generalized Attentional Pooling),利用低阶非线性操作近似二阶池化操作,同时作为一种注意力模型,我们的方法在结合了数据集中给定的人体关键点数据后,动作的识别性能得到进一步提升。实验证明了我们的方法和人体关键点的识别具有很好的互补性。通过上述三个内容,本工作在常见的基于深度学习的视频分析框架下,对视频中的语义信息的加权,视觉属性的挖掘和视频中的注意力模型进行了研究。通过这三个实验,本论文验证了显式地对视频中关键内容进行学习的可行性和有效性。

【Abstract】 With the boost of computation power with GPU and the appearance of large scale labeled visual data,Deep learning has achieved the state-of-the-art performance on many tasks in computer vision.In the field of video based action recognition,the multi-stream based ConvNet and 3D ConvNet have achieved promising results.However,since there is no explicit modeling of contexts and cues in videos,it is hard for a Con-vNet to exploit these useful information.Besides,since the appearance of action in the videos is uncertain,it is important to use attention model to focus on the most informa-tive duration of videos.In order to solve these mentioned issues,we proposed three work here.First we propose a semantic attention model based on multiple-stream framework,aiming at exploit contexts and cues in videos to improve the performance of action recognition.Multi-stream based ConvNet is a type of deep learning model which is widely used in action recognition tasks.By learning the features of several modals separately firstly,then combining them in different ways,the information from multiple domain is merged.However,there are several informative regions or objects in which can help to recognize the action in videos.Here we proposed a semantic attention model based on multiple-stream framework,aiming at exploit contexts and cues in videos to improve the performance of action recognition.Firstly we use object detection methods to find possible objects and cues.Then we add these contexts to the ConvNet by the ROI-pooling layer.Finally we use fully connected layer and softmax layer to find the response from context regions to determine the action in videos.Second we propose a visual attribute based 3D convolutional neural network to recognize actions in videos.Another popular ConvNet architecture in action recognition is 3D ConvNet,which expand the dimension of convolution kernels and pooling kernels to 3 in order to jointly learn the structure in videos.Although I3D,the state-of-the-art 3D ConvNet,achieves very high accuracy on video action recognition benchmark,lack of explicitly learning of visual attributes in videos,it is difficult for I3D to clarify the videos which are both similar on spatial pattern and temporal pattern.To solve this problem,here we propose a visual attribute mining and classifying structure to learn attributes explicitly in videos.Together with I3D ConvNet,we improve the accuracy of recognizing on UCF101 and HMDB51 dataset.Finally we propose generalized attentional pooling model to recognize actions.In video,an action as a certain motion pattern lasts for a period of time and the time is uncertain,and most video clips have no action.Therefore,the attentional model can be used to discover the spatial locations of the active segments and actions in the video.Based on this,we propose a generalized attentional pooling model,using low-order nonlinear operations to approximate second-order pooling operations,and at the same time as an attention model,our method recognizes the motion after combining key points of human body data.Performance has been further improved.Experiments have proved that our method is very complementary to the identification of key points on the human body.In summary,this work has studied the weighting of semantic information in video,the mining of visual attributes,and the attention model in video under the framework of common video analytics based on deep learning.Through these three experiments,this thesis verifies the feasibility and effectiveness of explicitly learning the key content in the video.

  • 【分类号】TP391.41;TP181
  • 【被引频次】6
  • 【下载频次】870
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络