节点文献
基于神经网络的中文分词算法的研究
Research on Chinese Word Segmentation Algorithm Based on Neural Network
【作者】 张晓淼;
【导师】 张利;
【作者基本信息】 大连理工大学 , 控制理论与控制工程, 2006, 硕士
【摘要】 汉语书写时是按句连写,词间无间隙,容易让人在句子的理解上产生偏差,这就给查询信息、机器翻译等工作造成了很大的困难:造成不相关结果的返回、找不到相关文档、翻译不准确等。所以为了解决这些问题就要对句中的词语进行精确的切分。 在对日常交流及其报刊杂志中所出现的各类语言现象进行深入分析研究后,本文对日常生活常见的典型歧义中所蕴含的语法现象进行了归纳总结,建立了供词性编码使用的词性代码库。以此为基础,利用神经网络的自组织、自学习的能力,达到对不同规则歧义字段的精确切分。在样本的选取上,本文所选样本空间基本囊括了歧义字段的各典型类型。样本训练前,先把字段中所包含的语法规则转换为神经网络能够接受的数据形式,将字段中的单词依词性代码库分别进行编码;在对输出结果所代表的含义进行解释时,从切分的表示方式上,以大量训练所得的输出结点值为依据对切分点进行判断。从而将字、词或抽象语法规则通过代码的表示方式与输入神经元对应,使切分方式与输出神经元相对应,找到了一个输入、输出逻辑概念到输入、输出模式的转换。通过大量数据训练达到了网络对歧义字段中包含的语法规则的学习,进而实现了对词语的准确切分。另外,通过采用给权值修正量加矩量项来修正学习速率的方法,对BP算法进行了改进,提高了收敛速度,使得分词效果得到了明显改善。 在采用三层BP网络进行大量样本训练后,由实验结果得出:算法在歧义字段分词上达到了93.13%的训练精度、92.50%的测试精度,在对未经训练的一般语料样本的处理上,达到了预期的切分效果。该分词方法提供了一种新的输入、输出逻辑概念到输入、输出模式的转换方式,成功地解决了由于字间组合方式无穷多而无法训练的难题,应用于词语切分上,取得了很好的分词效果。
【Abstract】 Chinese is written continuously as a whole sentence, and there is no space between words. It is easy to misconceive when understanding sentences. This brings great difficulty on the working of querying information. The difficulty is shown as that many irrelevant results are returned or no document can be found. Therefore, dividing the words of sentences exactly is needed in order to solve these problems.After deep research on all kinds of language phenomenon in daily communication, newspapers and magazines, the paper concludes the grammar phenomena included in typical different meanings which are common in daily life. It builds the part of speech code library supplied for part of speech encoding. Using Neural Network’s self-organization and self-study divides different rules different meanings paragraph exactly on the basis of this. As to selecting samples, the samples space selected by the paper includes all typical kinds of different meanings paragraph basically. Before training samples, the words of paragraph are encoded separately by part of speech code library in order to transform the grammar rule included in paragraph for the data form which could be accepted by Neural Network. On the way of dividing expressing, the division point is judged according to output node value through a great deal of training when explaining the signification presented by output result. Therefor, the characters, words and abstract grammar rule are corresponding to input nerve cell by the way of code expressing, division mode is corresponding to output nerve cell. A conversion is found from input and output logic concept to input and output mode. Network achieves studying the grammar rule included in different meanings paragraph through an amount of data training, moreover, the exact words division is realized. In addition, BP algorithm is improved by adding rectangular quantity item to power value allowance to amend study speed. The convergence speed is enhanced. The word segmentation effect is improved obviously.After a lot of training through adopting three layers BP network, the experiment result shows that algorithm reach 93.13% training precision and 92.50% test precision on differentmeanings paragraph word division, and achieves prospective division effect on the processing of general material samples which have not been trained. This word segmentation method provide a new conversion way from input and output logic concept to input and output mode. It solves the difficult problem of being not able to train due to infinite words combination. It is applied in word division and acquires an good word segmentation effect.
【Key words】 Chinese Word Segmentation; Natural Language Understanding; Different Meaning; Neural Network; BP Network;
- 【网络出版投稿人】 大连理工大学 【网络出版年期】2006年 04期
- 【分类号】TP391.1
- 【被引频次】24
- 【下载频次】1119