节点文献

基于半监督佳点集和Leader的K-means聚类算法研究

Research on K-means Clustering Algorithm Based on Semi-Supervised Good Point Set and Leader

【作者】 张娟

【导师】 张燕平;

【作者基本信息】 安徽大学 , 计算机应用技术, 2011, 硕士

【摘要】 信息技术与互联网的飞速发展,数据库应用规模、范围和深度的不断扩大,人们生产和收集数据的能力的不断提高,导致现实生活中各个领域的数据量以前所未有的速度海量增长着。面对如此庞杂的海量数据,如何找出这些大规模数据之间的内在关联性,从而提取出有用的信息,以建立供人们所用的知识资源,一直是研究者们的热点课题。数据挖掘是指从大量数据中发现隐藏的、有效的、新颖的、对决策有潜在价值的和最终可被理解的模式的过程,其在现实生活的许多领域都有着广泛的应用。聚类分析是数据挖掘三大领域之一,业已被广泛研究了几十年,至今不论在理论还是方法上都取得了丰硕的研究成果。其中以基于划分方法中的K-means聚类算法最为经典。K-means聚类算法的思想简单易行,而且时间复杂性接近线性,同时对大规模数据的挖掘具有高效性和可伸缩性。然而该算法存在着固有的缺陷:如算法对初始中心点敏感;聚类结果易陷入局部最优;算法适用于数值型数据和一般只能发现球状簇等。本文主要研究和分析了经典的K-means聚类算法,给出其优缺点和现有的一些改进方法。针对上述谈到的K-means聚类算法的不足,在聚类算法被研究的这几十年,许多学者都给出了相应的改进方法和策略,尤其针对前两种缺陷的改进算法举不胜举。而本文也意在探讨K-means算法的初始中心敏感性,并结合了半监督学习、Leader方法和佳点集理论,提出两种新的初始中心选取方法。论文所做的主要工作包括:1、基于半监督和Leader方法,提出了一种新的选取K-means聚类算法初始中心的方法,即S_SLK算法。利用监督信息来改善无监督学习的性能,结合能够保持数据对象本身分布特性的Leader方法优化了K-means聚类算法的初始中心,并改善了由此导致的聚类结果不够稳定的缺陷。2、运用佳点集理论能够得到比随机选取更好的点的优点,再次结合Leader方法,提出一种新的改进K-means的聚类算法。佳点集理论和Leader方法的结合方式从两种算法来体现,分别称为KLG和KGL算法。3、将改进的KLG和KGL算法分别与传统算法和文献中的算法做了相应的比较,并尝试了在K-means算法中仅引入佳点集理论或Leader方法后的效果,同时与KLG和KGL算法做了比较,实验结果和一系列的比较结果表明,改进后的算法具有一定的可行性和有效性,且最终可得出KGL算法优于其他几种算法。

【Abstract】 With the rapid development of the information technology and the internet, the data base application has been enlarging in term of dimension, area and depth, as well as the capacity of the production and collection of data have been improving, this will lead to the accumulation of a large number of data in various fields of real life. How to find the intrinsic relationship between these large-scale data, so that the hidden information can be extracted and knowledge resources can be built, this has been a hot topic.Data mining is the procedure of extracting of implicit, valid, novel and potentially valuable knowledge and ultimately understandable patterns or knowledge from large amount of data, which is widely applied in many areas in real life. Clustering analysis is one of the three major areas of data mining, which is widely studied for several decades and it has been achieved a mass of theories and methods. While the K-means clustering based on the partitioning method, is the most classical algorithm.K-means clustering algorithm is easy to achieved, scalable and high efficient for disposing big data set, as well as the complexity of time is nearly linear. However, there are inherent shortcomings of this algorithm, for example, 1)it is very sensitive to initial conditions,2)often gets trapped in local minimum,3)it is adapt to numerical data and has only the best capability to capture clusters inhyperspherical shape.In this paper, in-depth study and analysis of the topical clustering of K-means summed up its strengths, weaknesses and some improvement methods in recent years. For the shortcomings of the K-means, many of the relative improvement methods and strategies have been given by the scholars, especially for the defects of the 1) and 2). This paper focus on the sensitive of the K-means clustering algorithm to the initial value and combine with the semi-supervised learning, Leader approach and Good Point Set theory, presenting two new initial centers selection algorithms.The summary of work: Based on Semi-Supervised and Leader method, it is proposed that a new method of selecting the initial K-means clustering centers, i.e. S_SLK algorithm. This method that improves the instability of the clustering results with randomly selecting initial centers, which used supervised information to improve the performance of unsupervised learning and combined with the Leader approach, which could keep the distribution characteristics of the data object.A new improved K-means clustering algorithm was proposed, which adopted the theory of Good Point Set and Leader method. The theory of Good Point Set can produce points better than randomly selected’s. In this article, the binding mode of the Good Point Set and Leader approach is reflected from the two algorithms, called KLG and KGL algorithm.For the improved KLG and KGL algorithm, we conducted a large number of experiments to show those effectiveness and feasibility. At the same time, we took the experimental results to compare with the traditional algorithm and the literature algorithm. Experiments and the result of comparison show the improved algorithms significantly outperform the traditional and other initialization methods. And finally we come to the KGL algorithm is better than other algorithms which listed.

  • 【网络出版投稿人】 安徽大学
  • 【网络出版年期】2012年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络