节点文献
现代汉语通用分词系统的技术与实现
Technology and Implement of General-purpose Word Segmentation System in Modern Chinese
【作者】 罗智勇;
【导师】 宋柔;
【作者基本信息】 北京工业大学 , 计算机应用技术, 2002, 硕士
【摘要】 自动分词技术是中文信息处理的基础工程。任何基于词一级中文处理应用系统都离不开分词系统。自动分词技术的重点和难点在于歧义切分处理和未登录词识别。本文首先阐述了现代汉语通用分词系统(GPWS)中歧义切分技术和专名识别技术,在歧义切分技术中,提出了一种切分规则库与基于歧义知识库动态校正相结合的实用歧义处理策略;在专名识别技术中,本文提出了一种专名(包括译名在内的人名、地名、企业字号、企业名和机构名等)一体化、快速识别方法。从大规模真实语料的测试结果来看,歧义切分处理的精度、专名识别的正确率和召回率均达到了较高的水平。 其次,本文概要的分析了通用型分词系统的难点,阐述了GPWS的解决方案,给出了通用分词系统的评价标准;并提出了交互式分词系统的概念,给出了一种简单的交互式方法。取得了良好的效果。
【Abstract】 Word segmentation is the basis of Chinese information processing (NLP). Any natural language processing system beyond character level should have a built-in word segmentation block. Disambiguity and recognition of unknown words are most important points for design of word segmentation systems. In this paper, firstly, we introduce an applied strategy to disambiguity. Then we put forward an integrated and fast recognition strategy of proper noun, including Chinese person names, Chinese place names, translated foreign names and corporation & organization names, in modern Chinese word segmentation system, which successfully resolves the conflict among these proper nouns and ordinary words. Large-scale test on real corpus show that both of these strategies have got high performance and precision in disambiguity and recognition of proper nouns. In last part of this paper, we introduce the General-purpose Word Segmentation System in Modern Chinese (GPWS) and analyse the set of criteria for the evaluating a general-purpose segmentation system in terms of its comprehensiveness, extensibility and adaptiveness, and interactiveness besides precision. We also introduce an interactive strategy to provide alternative solutions and giving applications more choices without compromise. Large-scale tests on real corpus show that interaction, between word segmentation and upper applications, has made much contribution to the reduction of error in the original system.
【Key words】 Chinese information processing; general-purpose word segmentation; disambiguity; interactive strategy;
- 【网络出版投稿人】 北京工业大学 【网络出版年期】2002年 02期
- 【分类号】TP391.1
- 【被引频次】9
- 【下载频次】312