节点文献
基于多层CRFs的汉语介词短语识别研究
Research on Chinese Prepositional Phrase Identification Based on Multi-layer Conditional Random Fields
【作者】 张杰;
【导师】 郭禾;
【作者基本信息】 大连理工大学 , 计算机应用技术, 2013, 硕士
【摘要】 介词短语是汉语中一种重要的短语类型,在汉语中占有较大的比例。介词短语的正确识别可以简化句子结构;缩小中心动词的选择范围;降低句法分析的难度。基于介词短语识别的重要性,本文提出了基于条件随机场(Conditional Random Fields, CRFs)的汉语介词短语识别方法,并采用基于转换的错误驱动学习方法对结果进行校正,较好地完成了介词短语识别任务。本文将介词短语识别问题转化为序列标注问题,基于CRFs模型在序列标注上的优点,选用CRFs模型作为标注模型,通过分析介词短语的结构特征,为CRFs模型选取了6个有效的特征,并采用递增式的学习方法选择特征模板,优化了模型的性能;针对句子中含有多个介词短语识别效果不理想的现状,提出了多层识别的方法,分层识别每一个介词短语,将识别出的介词短语用特殊的符号替换,进而简化句子结构,缩短句子的长度;本文为了进一步提高介词短语识别的效果,采用基于转换的错误驱动学习方法对基于CRFs模型的识别结果进行校正。论文对基于单层CRFs模型、基于多层CRFs模型及加入错误驱动学习方法分别进行实验。实验证明,本文采用的多层CRFs模型的介词短语识别方法是有效的。通过对人民日报2000年语料中的7000多个介词短语进行五倍交叉实验,精确率、召回率、F1值分别为91.45%、91.39%和91.42%。在引入基于转换的错误驱动的学习方法对识别结果进行校正后,精确率、召回率、F1值分别达到91.98%、91.92%和91.96%,进一步提高了识别的效果。本文对介词短语识别的研究取得了较好的成果,可以将该成果应用到句法分析、机器翻译等领域。
【Abstract】 Prepositional phrases, as a class of important phrases, account for a rather large proportion in Chinese. Therefore, prepositional phrase identification has significant meaning which simplifies the structure of sentence, reduces the number of candidate main verbs and makes the parsing easily. In this paper, we present a system of prepositional phrase identification based on Conditional Random Fields (CRFs). Moreover, a transformation-based error-driven learning approach is adopted to revise the prepositional phrase identification results of CRFs model.This paper coverts the task of prepositional phrase identification into sequence labeling, and adopt CRFs model as our identification model. Through analyzing the structural characteristic of prepositional phrases, six features are extracted as our feature set and an effective feature template is selected based on incremental learning method. For the situation of more than one prepositional phrase existing in a sentence, in order to reduce the complexity of phrases and improve the accuracy of prepositional phrase identification, a multi-layer method, which identifies prepositional phrase from right to left based on CRFs and replace the identified preposition phrases, is proposed in this paper. For further improve the identification results, a transformation-based error-driven learning approach is adopted to revise the identification results based on CRFs.Experiment shows that, the multi-layer identification method based on CRFs is effective. Experiments carried out on the corpus of the People’s Daily2000containing more than7,000prepositional phrases, the precision, recall and F-value can achieve91.45%,91.39%and91.42%respectively. With the help of transformation-based error-driven learning, the performances of CRFs based prepositional phrase identification are improved to91.98%,91.92%and91.96%.Our research on prepositional phrase identification achieves better performance, which can apply to the fields of parsing, machine translation and so on.