节点文献
基于深度学习的图像描述生成算法研究
Research on Image Captioning Algorithm Based on Deep Learning
【作者】 张磊;
【导师】 王爽;
【作者基本信息】 西安电子科技大学 , 模式识别与智能系统, 2020, 硕士
【摘要】 图像描述生成作为计算机视觉领域与自然语言处理领域的交叉部分,在图像检索、场景理解等众多方向都有着广阔的发展空间。获取图像中的目标以及它们之间的关系,并使用自然流畅的语言对其进行描述,是图像描述生成任务所要解决的问题。近年来,随着深度学习的兴起,图像描述生成领域快速发展。生成的图像描述越来越准确,同时描述的风格趋向于多样化。本文针对图像描述生成问题展开了研究,总结了图像描述生成领域的发展现状,并提出了几种基于编码器-解码器架构的图像描述生成方法,主要工作如下:(1)基于三层LSTMs(Triple LSTMs,Tri-LSTMs)模型的图像描述生成方法。当前大多算法生成的图像描述过于粗糙,不能对图像中的细节进行描述。为了解决这一问题,首先构建了包含级联注意力机制的Tri-LSTMs模型。该模型同时利用语义属性与图像特征指导图像描述的生成,有效地提升了指导信息的丰富性。随后设计了基于双向间隔损失函数的图文检索模型,用于监督Tri-LSTMs模型的训练。对于TriLSTMs模型所生成的文本,该图文检索模型可以计算匹配的图像-文本对的匹配度与不匹配的图像-文本对的匹配度。通过将其将其反馈给图像描述生成模型,可以生成更具区分性的图像描述。该方法在保证图像描述准确性的同时,可以更多地关注图像中的细节。(2)基于双层前文LSTMs(Double Preface LSTMs,DP-LSTMs)模型的图像描述生成方法。针对编码器-解码器架构中,解码器对序列的长期依赖关系建模不足的问题,引入双层的前文注意力机制。在每一个时刻,为序列前面所有时刻的解码器状态分配权重。与当前时刻关联性越大的状态获得越大的权重,从而对当前时刻产生更大的影响。通过这种策略,解码器不再仅仅依赖于前一个时刻的状态,从而有效地增强了序列的长期依赖,避免序列过长时导致的信息丢失问题,提升了生成的图像描述的质量。(3)基于三重语义属性LSTMs(Triple Semantic LSTMs,TS-LSTMs)模型的遥感图像描述生成方法。首先建立语义属性库,并使用多分类器为遥感图像预测每个语义属性在该图像中出现的概率。挑选出与图像相关性最强的若干语义属性,即出现概率最大的若干语义属性。提取这些语义属性的词向量表示,并将其输入到解码器的输入层、输出层等位置,作为指导信息参与图像描述的生成。该算法在多个遥感图像描述数据集上均生成了准确的描述,在多个指标领先于当前的遥感图像描述生成方法。
【Abstract】 Image captioning,as the intersection of computer vision and natural language processing,has broad development space in many directions such as image retrieval and scene understanding.Obtaining the objects in the image and the relationship between them,and describing them in natural and fluent language are the problems to be solved in the image captioning task.In recent years,with the rise of deep learning,the field of image captioning has developed rapidly.The generated image caption is becoming more and more accurate,and the style of caption tends to be diversified.This paper studies the problem of image captioning,summarizes the development status of image captioning,and proposes several image captioning methods based on encoder-decoder architecture.The main work is as follows:First,image captioning method based on triple LSTMs(Tri-LSTMs)model.The image caption generated by most current algorithms are too general to describe the details in the images.To solve this problem,a Tri-LSTMs model including a cascading attention mechanism is first built.The model uses both semantic attributes and image features to guide the generation of image captions,effectively improving the richness of the guidance information.Then a self-retrieval model based on the bidirectional margin loss function is designed to supervise the training of Tri-LSTMs models.For the caption generated by the Tri-LSTMs model,the self-retrieval model can calculate the matching degree of matching image-caption pairs and non-matching image-caption pairs.By feeding the matching degrees back to the image captioning model,more distinguishable image captions can be generated.This method can pay more attention to the details in the image while ensuring the accuracy of the image caption.Second,image captioning method based on double preface LSTMs(DP-LSTMs)model.Aiming at the problem of insufficient modeling of the long-term dependence of the sequence in the encoder-decoder architecture,a double-layered attention mechanism is introduced.At each moment,weights are assigned to the decoder states at all times before the sequence.The state with greater relevance to the current moment gets greater weight,which has a greater impact on the current moment.Through this strategy,the decoder no longer only depends on the state of the previous moment,thereby effectively enhancing the long-term dependence of the sequence,avoiding the problem of information loss caused by the sequence being too long,and improving the quality of the generated image caption.Third,remote sensing image captioning method based on triple semantic LSTMs(TSLSTMs)model.A semantic attribute database is firstly established,and a multi-classifier is used to predict the probability of each semantic attribute appearing in the image for the remote sensing image.The semantic attributes with the strongest correlation with the image are selected,that is,the semantic attributes with the highest occurrence probability.The word vector representations of these semantic attributes are extracted and input to the decoder’s input layer,output layer,etc.to participate in the generation of image captions as guidance information.The algorithm generates more accurate captions on multiple remote sensing image captions data sets,and leads the current remote sensing image captioning methods on multiple indicators.
【Key words】 image captioning; semantic attributes; self-retrieval; attention mechanism;