节点文献

历史知识图谱的实体关系挖掘方法

Entity Relation Mining Method in Historical Knowledge Graph

【作者】 张帆

【导师】 王晓龙;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2019, 硕士

【摘要】 随着互联网的不断发展,互联网中数据量也在不断的增多。然而大部分数据以文本的形式存储,如何有效的将数据从文本中抽取出来是一个十分重要问题。实体关系抽取作为信息抽取的关键组成部分,将非结构的自然语言文本结构化,是问答系统和知识图谱等自然语言应用的基础。然而传统关系抽取方法在训练前多需要人工标注数据、选取特征并且定义关系类型需要专业领域的专家辅助,这样消耗大量的人力和时间,所以如何以更少的代价获取实体关系变的尤为重要。为解决以上问题,本文利用远程监督、深度学习、自然语言处理等技术为历史领域的实体关系挖掘设计了两种算法。本文在研究历史实体关系挖掘方法过程中,收集了百度百科、维基百科、课本及通用知识图谱等资源作为历史数据。在历史领域关系挖掘研究中,还没有出现关系类型覆盖率较高的公开数据集,人工预定义关系类型会出现偏差和不全面的问题。针对此问题,本文提出了基于规则匹配的历史实体关系抽取方法,提取非结构文本中的关系指示词,避免了人工预定义关系类型的问题。同时在模型中增加对历史文本的特殊句法处理和Logictic回归模型提高关系三元组抽取准确率。在针对人工标注数据代价高的问题上,利用远程监督的方法自动标注了训练数据,但远程监督也会带来句内噪音和标注错误的问题。为了解决这两个问题,本文提出了基于SDP、Bi GRU和APCNNs的融合关系抽取模型。其中通过最短依存路径SDP对句内噪音进行过滤,减少了句子长度,有效的解决了句内噪音问题。在加入APCNNs后,利用了其中基于句子级别的注意力机制和分段最大池化的方法,弱化了错误标注对关系抽取带来的影响。同时,将Bi GRU加入到模型的向量表示阶段,学习到了词语的上下文信息,为模型训练增加了更多特征,提高了模型的准确率。实验表明,基于SDP、Bi GRU和APCNNs的融合关系抽取模型在远程监督构建的历史训练语料中取得了不错的效果。

【Abstract】 With the continuous development of the Internet,the amount of data in the Internet is also constantly increasing.However,most of the data is stored in the form of text.How to effectively extract the data from the text is a very important issue.Entity relationship extraction,as a key component of information extraction,structuring unstructured natural language texts is the basis of natural language applications such as question and answer systems and knowledge graph However,the traditional relationship extraction method requires manual data annotation,feature selection and relationship type definition before training,which requires the assistance of experts in the professional field.This consumes a lot of manpower and time,so how to obtain the entity relationship with less cost becomes particularly important.In order to solve the above problems,this topic uses distant supervision,deep learning,natural language processing and other techniques to design two algorithms for entity relationship mining in the historical field.In this paper,baidu encyclopedia,wikipedia,textbook and general knowledge graph are collected as historical data in the research of entity relationship mining method.In historical studies,there is no public data set with high coverage of relationship types,and manual predefined relationship types may be biased and incomplete.Aiming at this problem,this paper proposes an entity relationship extraction method based on rule matching to extract the relationship indicators in unstructured text,which avoids the need of manually predefined relationship types.At the same time,the special syntax processing of the historical text and the Logistic regression model are added to the model to improve the extraction accuracy of relational triples.In view of the high cost of manually annotated data,the distant supervision method is used to automatically annotate the training data but the distant supervision also brings the problem of intra-sentence noise and labeling errors.In order to solve these two problems,this paper proposes a fusion relationship extraction model based on shortest dependent path(SDP),Bi GRU and APCNNs.Among them,the intra-sentence noise is filtered by the shortest dependent path SDP,which reduces the sentence length and effectively solves the problem of intra-sentence noise.After the addition of APCNNs,the attention mechanism based on sentence level and the method of segmenting maximum pooling were used to weaken the influence of wrong labeling on relationship extraction.At the same time,Bi GRU is added to the vector representation stage of the model,and the context information of the words is learned,which adds more features to the model training and improves the accuracy of the model.Experiments show that the fusion relationship extraction model based on SDP,Bi GRU and APCNNs has achieved good results in the historical training corpus constructed by distant supervision.

  • 【分类号】TP391.1
  • 【被引频次】1
  • 【下载频次】138
节点文献中: 

本文链接的文献网络图示:

本文的引文网络