节点文献

基于条件随机场的自动分词技术的研究

Study of Automatic Segmentation Technique Based on Conditional Random Fields

【作者】 陈晴

【导师】 姚天顺;

【作者基本信息】 东北大学 , 计算机系统结构, 2005, 硕士

【摘要】 随着科技的发展和海量信息的涌现,信息处理技术已经成为当今世界发展不可或缺的一部分,要在海量的信息中提取有用的知识,就必须要让机器“读懂”这些由人类语言所描述的信息,而词是最小的能够独立活动的有意义的语言成分。将词确定下来是理解自然语言的第一步,只有跨越了这一步,我们才有可能对信息进行更深入的处理,以至于让机器理解人类语言。本实验室对机器翻译和自然语言处理的研究,在很大的程度上都要依赖于如分词等序列标记和分割的技术,以便减少错误的蔓延,并进行深入的研究。 条件随机场是近年来提出的,用于标记和分割序列数据的条件概率模型,也是在给定输入节点条件下计算输出节点的条件概率的无向图模型。它不需要以隐马尔可夫模型为代表的“生成”模型那样的严格独立假设,并克服了最大熵马尔可夫模型和其他“非生成”模型所存在的标记偏置的问题。该模型可以非常容易的将输入序列中的任意特征或是语言本身所固有的特征加入到模型中,我们不仅可以将传统的HMM序列模型的转移特征和发射特征加入进来,而且也可以将一些其他的信息加入进来,比如构词规则,领域特征,词典信息等等。 本文系统的介绍了条件随机场的定义、模型结构、特征函数、参数估计及其训练方法等。并将条件随机场应用于汉语自动分词,得到了比以往用于序列标记和分割的模型更好的效果,从实验上验证了条件随机场在序列标记和分割方面的优势;并在不断添加特征的条件下应用条件随机场进行了大量的实验,在实验中,条件随机场表现出了非常优异的性能。

【Abstract】 In company with the development of technology and the expansion of mass information, Information Processing Techniques have been one of the most important parts in technology developing in today’s world. To extract useful knowledge from the mass information, it must be possible to make machines "understand" the information formed by human languages. However, words are the least language elements which can be independently used and have real meaning. It is the first step to understand the natural language that to identify the words, only by achieved the first step, could it be possible to deal with the information in depth, even make the machines understand human languages. The researches of machine translation and natural language processing in our lab mostly depend on the technique of sequence labeling and segmenting, such as segmentation, so as to reduce the extension caused by errors , and to do more deep research.Conditional Random Fields (CRFs), a recently introduced conditioned probabilistic model for labeling and segmenting sequential data, is a undirected graph model that calculate the conditional probability over output nodes given the input nodes. It relaxes the strong independence assumptions which generative model must have, such as Hidden Markov Model, and overcomes the label-bias problem exhibited by Maximum Entropy Markov Model and other non-generative models. This model can easily incorporate arbitrary features of the input sequence and the implicit ones of the language in itself, and so we can not only introduce the transition and emission features in traditional HMM modeling, also introduce some other information, such as the rules of words’ formation, domain features, lexicon etc.This text systematically introduces the definition of CRFs, structure of the CRFs model, feature functions, parameter estimate and training methods. Applying CRFs to Chinese automatic segmentation, we obtained a better performance in comparison with the model already used in sequence labeling and segmenting, and verified the advantages of the CRFs model in sequence labeling and segmenting by experiments;

  • 【网络出版投稿人】 东北大学
  • 【网络出版年期】2005年 07期
  • 【分类号】TP391.1
  • 【被引频次】55
  • 【下载频次】1319
节点文献中: 

本文链接的文献网络图示:

本文的引文网络