节点文献

基于类划分和近邻选取的k近邻算法研究

Research on K Nearest Neighbor Algorithm Based on Class Division and Neighbor Selection

【作者】 王芳

【导师】 张德生;

【作者基本信息】 西安理工大学 , 数学, 2020, 硕士

【摘要】 在大数据时代,怎样从海量数据中挖掘出有用信息已经成为大众广泛关注的一个焦点问题,数据挖掘技术为解决该问题提供了一个有效途径。数据分类是数据挖掘中的一个重要方法,其中k近邻分类算法凭借其简单、易操作等优点被人们广泛应用。但与此同时,k近邻分类算法也存在着对k值选取敏感、易受不平衡数据影响和距离度量选取过于简单等问题。本文主要在k近邻分类算法和基于局部均值的k近邻分类算法的基础上,利用群智能优化和稀疏表示等技术对k近邻分类算法进行改进,来克服原始算法的一些缺陷,进而提升算法的分类性能。具体研究内容和研究结果如下:1.针对基本的蚱蜢优化算法易陷入局部最优和收敛精度不高等问题,提出了一种改进的蚱蜢优化算法。首先,利用混沌反向学习初始化策略,产生了一群较优的初始种群;利用自然指数递减策略,平衡了算法的勘测和开发能力;利用高斯变异策略,克服了算法的易陷入局部最优的问题。然后,利用10个基准测试函数进行实验,结果表明,改进的算法具有较高的收敛效率和求解精度。最后,针对距离加权k近邻算法对距离度量的依赖性过高,生成的权重具有一定的随机性等问题,提出了一种基于改进的蚱蜢优化算法的距离加权k近邻算法,利用优化算法和距离度量生成一组最优的近邻权重,对算法的投票过程进行加权。利用UCI数据库中的6个数据集进行实验,结果表明:该算法不易受k值变化的影响,且分类精度有所提高。2.在k近邻算法中每个属性对分类过程的影响是相同的,导致一些相关性较弱的特征会引起新数据的分类错误,另外,k近邻算法当面临不平衡数据集和异常值时,传统的多数投票原则会出现不同程度的错误划分。针对这些问题,提出了一种基于互信息和局部均值的k近邻改进算法。首先,利用互信息的相关度对属性进行加权,其次,基于局部均值和类贡献建立了综合类划分策略。最后,采用UCI数据库中的5个数据集,通过十倍交叉验证方法来验证改进算法的性能,结果表明:改进算法在不同类型数据集中均具有较高的准确性和较强的稳定性。3.基于多局部均值的k次调和近邻算法对所有属性赋予相同权重,忽略了不同属性贡献率的差异;仅根据距离排序选取近邻样本,未充分考虑样本的邻域分布。针对这些问题,提出了一种基于属性权重和稀疏系数的调和近邻算法。首先,利用互信息和增益率定义了一种综合属性权重对距离公式进行加权。其次,利用稀疏系数较强的模式识别能力,建立了两步近邻选取策略来挑选最优近邻样本。最后,通过UCI和KEEL数据库中的12个标准数据集和2个含噪数据集对该算法进行实验,并与6种经典算法进行比较。结果表明:改进算法在较好的鲁棒性的基础上取得较高的分类准确率。

【Abstract】 In the era of big data,how to mine useful information from massive data has become a focus of public concern.Data mining technology provides an effective way to solve this problem.Data classification is an important method in data mining.Among them,and k nearest neighbor classification algorithm is widely used because of its simplicity and easy operation.But at the same time,the k nearest neighbor algorithm also has problems such as being sensitive to the selection of k values,susceptible to imbalanced data,and too simple to select distance measures.This paper mainly on the basic of the k nearest neighbor classification algorithm and the k nearest neighbor classification algorithm based on local mean,the k nearest neighbor classification algorithm is improved to overcome some defects of the original algorithm by the group intelligent optimization and sparse representation,so as to effectively improve the performance of the algorithm.Specific research contents and results are as follows:1.Aiming at the problem that the basic grasshopper optimization algorithm is easy to fall into the local optimum and the convergence accuracy is not high,an improved grasshopper optimization algorithm is proposed.First,the chaotic reverse learning initialization strategy was used to generate a group of better initial populations,and the natural exponential declining strategy was used to balance the algorithm’s ability to explore and develop.Besides,the Gaussian mutation strategy was used to overcome the problem of the algorithm’s vulnerability to local optimization.Then,simulation experiments are carried out through 10 benchmark test functions,and the results show that the improved algorithm has higher convergence efficiency and solution accuracy.Finally,in view of the distance weighted k neighbor k neighbor algorithm for distance measurement is too high,the dependence of the generated weights has some problems such as randomness,proposed an optimization algorithm based on improved grasshopper distance weighted k nearest neighbor algorithm,using the optimization algorithm and distance measure next neighbour to generate a set of optimal weights,the algorithm of the weighted voting process.The test is performed on 6 data sets in the UCI database.The results show that the algorithm is not affected by the change of k value,and the classification accuracy is improved.2.In the k nearest neighbor algorithm,each attribute has the same impact on the classification process,which leads to some features with weak correlation that will cause classification errors in the new data.In addition,when the k nearest neighbor algorithm is faced with unbalanced data sets and outliers,the traditional majority voting principle will have different degrees of misclassification.For these problems,a k nearest neighbor algorithm based on mutual information and local mean is proposed.Firstly,the correlation degree of mutual information is used to weight the attributes.Secondly,a comprehensive classification strategy is established based on the local mean and class contribution.Finally,five data sets in UCI database were used to verify the performance of the proposed algorithm through ten-fold cross validation method,and the experimental results verified that the algorithm has high accuracy and stability in different data sets.3.The multi-local means-based k-harmonic nearest neighbor assigns the same weight to all attributes,thus ignoring the difference in the contribution rate of different attributes.In addition,the algorithm only selects the neighbor samples according to the distance sort,and does not fully consider the neighborhood distribution of the samples.To solve these problems,a harmonic neighbor algorithm based on attribute weight and sparse coefficient is proposed.Firstly,the distance formula is weighted by defining the comprehensive attribute weight by mutual information and gain rate.Secondly,a two-step neighbor selection strategy is established to select neighbor samples based on the strong pattern recognition ability of sparse coefficient.Finally,12 standard data sets and 2 noisy data sets from the UCI and KEEL databases were used to test the algorithm and compared with 6 classical algorithms.The results show that the improved algorithm achieves a high classification accuracy on the basis of better robustness.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络