节点文献
关键区域感知的图像启发故事结尾生成研究
Key Region Awareness for Image-Guided Story Ending Generation
【作者】 李志刚;
【导师】 黄清宝;
【作者基本信息】 广西大学 , 电气工程, 2023, 硕士
【摘要】 图像启发的故事结尾生成是一种多模态生成任务,它要求模型根据输入的故事上下文和图像生成一句符合上下文逻辑和图像信息的故事结尾,具有良好的研究和应用价值。现有的工作通过将故事上下文与图像全局特征相融合,取得了一定的成效。然而,他们没有考虑故事上下文和图像不同区域之间的逻辑关系,且忽略了图像的高级语义信息,如图像情感。这可能导致生成的故事结尾在逻辑或情感方面与给定的图文信息不一致。因此在本文中,我们尝试通过获取符合故事上下文发展趋势的图像关键区域,并引入图像的情感特征,使模型生成符合逻辑且富有情感的故事结尾。为此,我们提出了一个基于图的关键区域感知的图像启发故事结尾生成模型(KRA),主要由图匹配机制、知识提取与过滤器和图像情感提取器组成。在该模型中,我们提出了一个图匹配机制用于获取符合上下文发展趋势的图像关键区域的特征。具体来讲,我们设计了一个知识提取与过滤器用于获取故事上下文关键词的知识图,并利用场景图解析器构建输入图像的场景图。通过有序地将各关键词知识图与图像场景图进行对比,得到一个场景子图。该子图对应的图像区域为符合上下文发展趋势的关键区域,子图中的特征为该关键区域的图像特征。此外,我们采用图像情感提取器获取了该关键区域的情感特征。因此,KRA模型能够获取符合上下文发展趋势的图像关键区域的内容及情感特征,生成更加符合逻辑且富有情感的故事结尾。实验表明,KRA模型性能优于各基准模型。经过分析我们认为KRA模型还存在一些可以改进的地方:1)该模型中的图匹配机制是一种机械式的对比过程,不够灵活有效;2)该模型只利用了图像的关键区域特征而忽略了图像其他部分,没有能够充分利用图像信息。因此,我们对KRA模型进行了改进,提出了一个多粒度特征融合的图像启发故事结尾生成模型(MGF)。在MGF模型中,我们首先获取图像情感特征并将其作为图像全局特征的一部分。然后,我们设计了一个场景子图选择器,通过挑选与故事上下文最相关的场景子图来获取图像的关键区域。最后,我们分别从物体级别、区域级别和全局级别对文本与图像特征进行了融合。改进的MGF模型能够充分利用给定的图文信息,生成符合逻辑、富有情感且内容丰富的故事结尾。实验结果表明,MGF模型在自动评估和人工评估指标上均取得了当前最好的效果。
【Abstract】 Image-guided story ending generation(IgSEG)aims at generating a reasonable and logical story ending given a story context and an ending-related image.It is a subtask of the multimodal generation with great academic and practical significance.Existing models have achieved some success by fusing global image features with the story context through an attention mechanism.However,they ignore the logical relationship between the story context and the image regions,and have not considered the high-level semantic features of the image,such as visual sentiment.This may result in the generated ending inconsistent with the logic or sentiment of the given information.Therefore,in this dissertation,we aim to make the generated endings more logical and sentimental by obtaining the image features of the key region that conforms to the development trend of the story context,and introducing the sentiment features of the image.To this end,we propose a graph-based Key Region Awareness(KRA)model for Ig SEG,which mainly consists of a graph matching mechanism,a knowledge extraction and filter,and an image sentiment extractor.In this model,we propose a graph matching mechanism to obtain the image features of the key region that conforms to the development trend of the story context.Specifically,we design a knowledge extraction and filter to obtain a knowledge graph for each keyword in the story context,and adopt a scene graph parser to construct a scene graph for the input image.By sequentially comparing each keyword knowledge graph with the image scene graph,we can obtain a scene subgraph.The image region corresponding to the scene subgraph is the key region that conforms to the development trend of the story context,and features in the subgraph are regarded as the image features of the key region.Besides,we employ an image sentiment extractor to obtain the sentiment features of this region.Therefore,the KRA model can obtain the image content and sentiment features of the key region that conforms to the development trend of the story context,and generate a more logical and sentimental story ending.Experimental results show that the KRA model outperforms other baselines.Afterwards,we analyzed that the KRA model still has some deficiencies:1)the graph matching mechanism in the KRA model is a mechanical comparison process,which is not flexible and effective.2)the KRA model only utilizes the image features of the key region but ignores the other regions,which has not made full use of the image information.Therefore,we improve the KRA model and design another framework for Ig SEG named Multi-Granularity feature Fusion(MGF)model.Concretely,we first obtain image sentiment features as part of the global image features.We then design a scene subgraph selector to capture image features of the key region by selecting the scene subgraph that is most relevant to the story context.Finally,we fuse the textual and visual features from object,region,and global level,respectively.The MGF model is thereby capable of effectively utilizing the textual and visual information,and generate a more logical,sentimental and informative story ending.Experimental results show that the MGF model has achieved the best performance on both automatic and human evaluations.
- 【网络出版投稿人】 广西大学 【网络出版年期】2025年 01期
- 【分类号】TP391.41