节点文献

中文词法分析技术的研究与实现

Research and Implementation of Chinese Lexical Analysis Technology

【作者】 张会鹏

【导师】 刘挺;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2006, 硕士

【摘要】 中文词法分析是中文信息处理中的一项基础性工作。词法分析结果的好坏将直接影响中文信息处理上层应用的效果。本文针对词法分析中的中文分词、词性标注和动词细分类进行了深入的研究并实现了一个实用化的词法分析系统IRLAS。通过权威的评测和实际应用表明,IRLAS是一个高精度、高质量的、高可靠性的词法分析系统。众所周知,切分歧义和未登录词识别是中文分词中的两大难点。文本采用了基于词类的分词概率模型,此模型把词归为若干类别并且把这些类别纳入到一个统一的概率模型框架下。通过选择概率最大的切分路径可以消除掉大部分的切分歧义。对于未登录词识别,文本采用了基于角色标注的未登录词识别方法,这种方法能充分利用未登录词的上下文信息并把未登录词识别的问题转化为角色序列的标注问题。通过训练角色的隐马模型参数,再利用Viterbi算法即可标注出最优的角色序列,也即完成了未登录词的识别。词性标注和动词细分类可以为上层应用提供更丰富的语法信息,例如句法分析可以利用这些词性信息进行句法关系的识别。词性标注是隐马尔科夫模型的一个典型应用,本文利用隐马尔科夫模型的方法进行词性标注并取得了较高的准确率。动词细分类和词性标注有些类似,它是在词性标注基础上对其中的动词进行更细致的类别标注。根据动词细分类自身的特点,本文提出了一种改进的隐马尔科夫模型的方法进行动词类别的自动划分,通过与最大熵的方法进行比较,证明这种方法十分有效。本文还通过把动词细分类嵌入到句法分析系统中,从而有效地提高了句法分析的识别精度。

【Abstract】 Chinese lexical analysis is the base work in Chinese language processing. The result of lexical analysis will affect the performance of upper level application. This paper makes an intensive study of Chinese word segmentation, part of speech tagging and verb subdivision of lexical analysis and develops a practical lexical analysis system named IRLAS. Through official assessment and practical application, it proves that IRLAS is a high-precision, high-quality and high-reliablity lexical analysis system.As we all know, segmentation disambiguation and unknown word identification are two main difficulties in Chinese word segmentation. This paper adopts the word class based segmentation probability model. This model classifies words into many word classes and brings these classes into a unified frame of probability model. By choosing the segmentation path that has the maximum probability, it can eliminate most of the segmentation ambiguations. To solve the problem of unknown word identification, this paper adopts roles based tagging method. This method can make full use of the context information and transform the problem of unknown word identification to the problem of role sequence tagging. After training the role parameters of HMM, we can find out the optimal role sequence using Viterbi algorithm. By this way, we accomplish the identification of unknown word.Part of speech tagging and verb subdivision can provide richer grammatical information for upper level application. For example, parser can utilize the information of part of speech to distingulish the syntactical relationships of different types. Part of speech tagging is the typical application of HMM. This paper solves the part of speech tagging problem using HMM and reach a high precision. Verb subdivision is similar to part of speech tagging. It subdivides verbs into more detailed classes based on the result of part of speech tagging. According to the speciality of verb subdivision, this paper introduces a method of improved HMM to subdivide verbs. By comparing with the method of Maximum Entropy, it proves that this method is very effective. This paper also applies the verb subdivision system into the paser and greatly enhances the precision of

  • 【分类号】TP391.1
  • 【被引频次】37
  • 【下载频次】996
节点文献中: 

本文链接的文献网络图示:

本文的引文网络