节点文献

中文专有名词识别的研究

Study on Recognition of Chinese Proper Noun

【作者】 毛婷婷

【导师】 黄德根;

【作者基本信息】 大连理工大学 , 计算机软件与理论, 2006, 硕士

【摘要】 中文专有名词的自动识别是提高汉语分词系统正确率的关键技术,研究并实现有效的中文专有名词自动识别方法是本文的主要研究内容。 在深入研究现有中文专有名词识别方法的基础上,建立了一种基于支持向量机(SVM)的中文专有名词自动识别模型,并提出了四种不同的改进算法对中文专有名词进行识别:SVM和概率统计组合算法、修正的SVM-K近邻(KNN)算法、修正的SVM算法、聚类的SVM算法。 通过对SVM的识别结果进行分析发现,SVM和其它分类器一样,出错样本点多数集中在分类超平面附近。在SVM和概率统计组合算法中,对于分类超平面附近的样本采用概率统计方法进行识别,对于距离分类超平面较远的样本仍然使用SVM分类。 在修正SVM-KNN算法中,在特征空间中计算样本到SVM最优超平面的距离,当该距离大于给定的阈值时使用SVM对样本进行分类,否则使用修正KNN方法。对样本在空间的不同分布使用不同的方法对SVM的识别效果进行优化。 在采用修正SVM-KNN算法识别过程中发现,训练集存在不平衡性,影响传统SVM算法的分类效果。因此提出了修正的SVM算法,采用平移超平面的方法对传统SVM算法进行修正。 为了消除SVM由于训练集中两类数目的样本不平衡而引起的分类错误,采用了聚类的SVM算法,对训练集采用基于核的K-均值算法进行聚类,从而减小了数据的不平衡性,然后将聚类后的训练集利用SVM算法进行学习得到训练模型。 本文结合中文专有名词的特点,首先对训练语料中每个字进行分类标注及词性标注,抽取特征向量的属性,将其转换为二进制表示,在此基础上建立训练集;分别建立基于以上四种算法的专有名词识别模型,采用四种模型分别实现对测试语料中每个字的分类标注,根据分类结果识别出专有名词。实验结果表明,SVM和概率统计组合算法、修正的SVM-KNN算法、修正的SVM算法、聚类的SVM算法均比传统的SVM算法更具优越性,达到了较高的精确率和召回率。其中,SVM和概率统计结合的混合模型的识别效果最好。

【Abstract】 Chinese proper noun recognition is an important technique to improve the accuracy of segmentation. The main task of this paper is studying and implementing the effective approach of extracting proper noun from Chinese texts.Based on the research and analysis of current identification methods for Chinese proper noun, this paper sets up a model based on support vector machine(SVM) to identify Chinese proper noun, and presents four different methods to improve the performance of SVMs, the first is the corresponding algorithm combining SVM with statistical method, the second is modified SVM and K nearest neighbors(KNN) algorithm, the third is modified SVM algorithm, the fourth is cluster SVM algorithm.Analyzing the classification results obtained by sole SVM, the misclassified testing samples by SVM are mostly near the decision plane. In order to increase the accuracy of SVM, a hybrid model combining SVM with a statistical approach for Chinese proper noun is proposed, which is, in the region near the decision plane, statistical method is used to classify the samples instead of SVM, and in the region far away from the decision plane, SVM is used.A modified SVM-KNN classifier combined SVM with modified KNN is presented in the same way. Different classifiers are used for classifying the different test samples in spatial distributions. To fit the unbalanced data, a modified KNN classifier is proposed to modify classic KNN.Because of the unbalance of the training set (the negative samples are significantly outnumbered by the positive ones), which worsens the performance of SVM, a modified SVM classifier to identify Chinese proper noun is proposed. A algorithm called boundary movement is used to modify SVM.Cluster SVM algorithm is also proposed in order to reduce classification mistakes caused by the unbalance of the number of two kinds of samples in training set. In this algorithm, the training set was clustered using the kernel-based K-means clustering, thus a machine learning model is set up using SVM algorithm to the training set that has been clustered.In this paper, firstly, according to the characteristics of Chinese proper noun, words in the texts were segmented and assigned part-of-speech(POS) tags, a training set is constructed by extracting features of vectors. Secondly, four Chinese proper noun recognizing models are set up based on the above four methods. Lastly, the final identification results of the testing

  • 【分类号】TP391.43
  • 【被引频次】4
  • 【下载频次】499
节点文献中: 

本文链接的文献网络图示:

本文的引文网络