节点文献
面向裁判文书的关系抽取算法和数据增强研究
Research on Relation Extraction Algorithm and Data Augmentation for Judgment Documents
【作者】 朱旭;
【作者基本信息】 天津大学 , 工程硕士(专业学位), 2022, 硕士
【摘要】 随着人工智能的不断发展,人工智能辅助司法已成为研究热门。通过自然语言处理技术对裁判文书进行关系抽取是一项重要任务,该任务对信息化司法系统和知识图谱的构建具有重要意义。目前已有一些研究针对裁判文书开展关系抽取工作,但是由于精标注数据集资源较少和文书信息存在实体重叠的三元组,面向裁判文书的关系抽取算法仍存在改良空间。为此,本文从裁判文书人物属性信息数据集的构建、关系抽取算法和数据增强三个方面开展研究工作,主要成果包括:(1)针对裁判文书篇幅长的特点,设计了一种构建裁判文书人物属性信息数据集的方法,其包括五个步骤:预处理、分句、文本分类、分段和信息标注。利用该方法和少量的裁判文书完成数据集的构建工作。(2)针对裁判文书人物属性信息存在实体重叠三元组的现象,提出一种基于二阶段分层联合标注的关系抽取模型。模型采用一种先标注主体、再标注多个关系下对应客体的分层联合结构,结合改进的半指针-半标注序列标注方式和二阶段客体抽取方法实现三元组的抽取。新模型在四个数据集上的实验结果显示,与对比模型相比,新模型的F1值均有提升,证明了新模型的优越性。(3)针对裁判文书精标注数据集资源较少的问题,提出一种基于注意力机制的数据增强方法。首先将关系抽取数据集按照文本分类数据集的格式进行转化;然后利用该数据集对文本分类模型进行训练;最后使用文本分类模型获取文本中每个词语的注意力权重,根据注意力权重提取句子中对句意影响程度较深的词语,并将其按序连接组成新文本。实验结果显示所提方法能够提升关系抽模型的性能,证明了所提方法有助于解决数据集资源较少的问题。
【Abstract】 With the continuous development of artificial intelligence,using artificial intelligence technology assisting justice has become popular.The relation extraction for judgment documents through natural language processing is an important task,which is of great significance to the construction of knowledge graphs and information systems.At present,some researches have carried out relation extraction work on judgment documents.However,due to the limited resources of the finelabeled data set and the problem of entity overlapping,the relation extraction algorithm for judgment can still be improved.Therefore,this thesis carries out research work from three aspects: The construction of the judgment document data set,relation extraction algorithm and data augmentation.This thesis has the following main contributions:(1)According to the characteristics of judgment documents,this thesis designs a method for constructing a data set of character attribute information on judgment documents.The method includes five steps: preprocessing,sentence segmentation,text classification,segmentation,and information labeling.The method uses few judgment documents to complete the construction of the data set.(2)Regarding to the entity overlapping of character attribute information in judgment documents,this thesis proposes a relation extraction model based on two phase hierarchical joint labeling.The model firstly labels the subject and then labels the corresponding objects under multiple relationships,also using a half pointer-half labeling method and the two phases object extraction.Experiments show that,compared with the comparison model,F1 score of the proposed model has been boosted,which proves the superiority of the new model.(3)Regarding to the limited resources of the fine-labeled data set of judgment documents,this thesis proposes a data augmentation method based on attention mechanism.Firstly the relation extraction data set is transformed into the format of the text classification data set;Secondly,the data set is used to train a text classification model;Thirdly,the text classification model is used to obtain the attention weight of each word,which shows the importance of the word to the sentence meaning;Finally,a new text is composed of words that are selected based on attention weights.The experiment results show that the proposed method can improve the performance of the relation extraction model,which proves that the proposed method is helpful to solve the problem of less data set resources.
【Key words】 Natural Language Processing; Judgment Documents; Relation Extraction; Data Augmentation;
- 【网络出版投稿人】 天津大学 【网络出版年期】2025年 03期
- 【分类号】D926.13;TP391.1