节点文献
基于远程监督数据的关系抽取研究
Research on Relation Extraction Based on Distantly Supervised Data
【作者】 王海涛;
【导师】 陈文亮;
【作者基本信息】 苏州大学 , 计算机科学与技术, 2021, 硕士
【摘要】 关系抽取是知识图谱构建过程中的关键环节,具有重要的研究意义和应用前景。作为信息抽取的子任务之一,关系抽取旨在从文本中抽取出两个或多个实体之间的关系。根据关系中涉及的实体数量,又进一步分为二元关系抽取和多元关系抽取。目前,有监督关系抽取技术由于其出色的性能得到了广泛的应用,但仍然面临着标注数据不足的问题。尽管通过远程监督的方法能够快速生成大量的标注数据,但这些数据中不可避免地存在错误标注问题,特别是测试数据的错误会导致在比较模型性能时出现错误评估的问题。此外,随着预训练语言模型在其它自然语言处理任务上的突破,如何在二元关系抽取中更有效地使用预训练语言模型亟需进一步的探索和研究,篇章级的文本也对多元关系抽取带来了不小的挑战和困难。针对这些问题,本文的主要内容如下:(1)构建基于远程监督的关系抽取数据。本文基于中文百度百科数据,通过远程监督的方式构建了一个大规模的面向人物二元关系的关系抽取数据集(IPRE)。针对远程监督数据集中因错误标注而导致的错误评估问题,提出使用人工标注的方式对验证集和测试集进行处理,以相对较低的代价获取高质量的评估数据。更进一步,以金融领域多元关系为例,阐述基于远程监督的多元关系抽取数据集的构建方法,并对一个相关的公开数据集进行介绍和分析。(2)研究基于远程监督的二元关系抽取。针对IPRE数据关系体系的特点,本文提出了一种基于语言模型和二步分类的关系抽取方法。其主要思想是通过预训练语言模型对文本进行编码,并使用二步分类的策略进行模型训练和预测。针对句子级关系抽取任务和包级关系抽取任务,分别设计了不同的文本编码方式,能够充分发挥预训练语言模型在文本表征方面的优势。此外,还使用实体对特征强化语言模型对于关系抽取对象的感知和理解。实验结果表明,相比较于基准模型,本文提出的方法更加有效。(3)研究基于远程监督的多元关系抽取。在本文中,多元关系被认为是一种事件,多元关系抽取即事件抽取。针对篇章级文本中实体往往分布于多个句子的问题,本文提出了一种先验信息增强的事件抽取方法,能够有效地处理篇章级的事件抽取问题。其主要思想是,将篇章级事件抽取分解成三个子任务:事件类型识别、事件元素抽取、事件表填充。首先识别出文本中提及的事件,然后从文本的每个句子中抽取出相应的事件元素,最后通过事件表填充策略得到篇章级的事件抽取结果。在事件元素抽取过程中,将事件类型识别的结果作为先验信息使用,并尝试使用不同的预训练语言模型,提高了事件元素抽取性能。本文提出的方法在公开评测任务数据集上进行了验证,实验结果表明,本文提出的方法是非常有效的。综上所述,本文分别从数据集、二元关系、多元关系等角度分别对基于远程监督的关系抽取技术进行研究,并取得了一些初步的成果。我们希望本文的研究能对关系抽取等自然语言处理任务的发展带来一些帮助。
【Abstract】 Relation extraction is a crucial step in the process of knowledge graph construction,which has great research significance and application prospect.As one of the subtasks of information extraction,relation extraction aims to extract the relationship between two or more entities from text.According to the number of entities involved in the relationship,it is further divided into binary relation extraction and n-ary relation extraction.At present,supervised relation extraction technology has been widely used due to its excellent performance,but it always faces the problem of insufficient labeling data.Although we can quickly generate a large scale of labeled data by distant supervision,wrong labeling problem inevitably exists in these data,especially the error of testing data leads to wrong evaluation when comparing different models.In addition,with the breakthrough of pre-trained language model,how to use pre-trained language model more effectively in binary relation extraction needs further exploration and research,and document-level texts also bring great challenges and difficulties to n-ary relation extraction.To address these problems,the main contents of this thesis are as follows:(1)Constructing dataset for relation extraction by distant supervision.Based on the Chinese Baidu Encyclopedia data,this thesis constructs a large-scale relation extraction dataset(IPRE)oriented to binary relation between people by distant supervision.In view of the wrong evaluation problem caused by wrong labeling problem exsiting in the distant supervisied data,this thesis proposes to hire annotators to label the development data and testing data,in order to obtain high-quality evaluation data with a relatively low cost.Furthermore,taking the n-ary relation in the financial domain as an example,this thesis describes the construction method of n-ary relation extraction dataset based on distant supervision,and introduces a relevant public dataset.(2)Research on binary relation extraction based on distant supervision.According to the characteristics of relation system in IPRE data,this thesis proposes a method of relation extraction based on pre-trained language model and two-step classification,whose main idea is to encode the text through a pre-trained language model,and use a two-step classification strategy for model training and prediction.Different text encoding methods are designed for sentence-level and bag-level relation extraction task,which can take full advantages of the pre-trained language model in terms of the representation of text.And features of entity pair are used to enhance the language model’s understanding of the extracted objects.Experimental results show that compared to the baseline models,the proposed method is more effective.(3)Research on n-ary relation extraction based on distant supervision.In this thesis,n-aryrelation is considered as a kind of event,and n-ary relation relation extraction is dubbed as event extraction.In view of the problem that entities are often scattered across different sentences in the document-level text,this thesis proposes a prior information enhanced extraction method for document-level event extraction,which can be decomposed into three subtasks:event detection,event argument extraction and event table filling.First,identify the events mentioned in the text,then extract the corresponding event arguments from each sentence of the text,and finally the results of document-level event extraction are summarized through the strategies of event table filling.In the process of event argument extraction,the results of event detection are used as prior information,and different pre-trained language models are tried to improve the performance of event argument extraction.The method proposed in this thesis has been verified on the public evaluation dataset,and the experimental results show the effectiveness of the proposed method.In summary,this thesis focuses on the relation extraction technology based on distant supervision from the perspective of dataset,binary relation and n-ary relation,and achieved some preliminary progress.We hope that the research in this thesis would contribute to the development of natural language processing tasks such as relation extraction.
【Key words】 Distant Supervision; Dataset; Relation Extraction; Event Extraction;