节点文献

数据挖掘中特征选择与聚类算法研究

The Research of the Feature Selection and Cluster Algorithms in Data Mining

【作者】 李红

【导师】 杨元生;

【作者基本信息】 大连理工大学 , 计算机应用技术, 2010, 硕士

【摘要】 随着数据获取技术精度的不断提高,产生的数据的维数越来越多,其中包含一些噪音和无关的信息,简单统计学方法已经不能满足人们对于知识发现的需要。面对这种“数据极其丰富而信息相对缺乏”的情况,数据挖掘技术逐渐显现出的它的优越性,成为一种强有力的分析手段和有效的分析工具。特征选择与聚类是数据挖掘中两个主要的研究领域。特征选择的目的在于从海量数据中提取出有用信息,从而提高数据的使用效率;聚类则可以在无人工因素的干扰下给出原始数据整体评价。近几年来,基于遗传算法的特征选择方法以及近邻传播(affinity propagation,简称AP)的聚类算法受到了大家的广泛关注。本文首先在多种群代理遗传算法的基础上,通过改变编码方式并且与融合思想结合提出了一种新的基于遗传算法的特征选择方法。该方法在保持原有算法优点的基础上,有效的避免了原始算法得到的特征结果中包含特征数目过多的缺点,通过特征选择频数得到了特征的重要性排序,利于挑选重要特征进行分析。受多种群代理遗传算法的启发,提出了一种链状多种群遗传算法,该方法通过构造一种新的种群结构和改进选择策略提高种群的多样性。其次在聚类的研究中,本文提出了一种基于近邻传播的特征加权聚类方法,通过特征的加权,从而综合考虑了不同特征在聚类中的不同作用,使得聚类结果较传统方法更能反映出数据的信息。通过对采用代谢组学方法得到的肝病数据进行特征选择,基于短编码的多种群代理遗传算法与链状多种群遗传算法可以有效的避免结果中包含的特征数目过多缺点,同时提高了分类的准确率。在聚类研究中,通过对UCI中数据集进行测试,与原始的近邻传播算法相比,基于近邻传播的特征加权聚类算法可以不同程度的提高聚类的准确率。

【Abstract】 With the improvement of the data acquisition techniques, the dimension of the data is becoming larger and larger and some noise and redundant information are contained. The simply statistical methods have not the ability to satisfy the need of discovery for knowledge. Data mining technology is gradually emerging its superiority when facing the situation that data is extremely rich but information is relatively lack, and it is becoming a powerful analysis means and effective analysis tool.Feature selection and cluster analysis are two main fields in data mining. The aim for feature selection is to filter out useful information and improve the accuracy. Cluster analysis is aimed to give a overall evaluation without the interference of artificial factors. In recent years, the genetic algorithm-based feature selection and affinity propagation cluster have received widespread attention. In this paper, we present a new genetic algorithm-based feature selection method by changing the encoding strategy and combining the ensemble thought on the basis of multi-population agent genetic algorithm. This method can not only keep the advantage of the multi-population agent genetic algorithm, but can reduce the number of the features in the result. Through the frequency of the features, the order of the feature importance is got, which is conductive to choose important features for analysis. Inspired by the multi-population agent genetic algorithm, we also propose a chain-like multi-population genetic algorithm for feature selection, this method can improve the diversity of the population by constructing a new population structure and selection strategy. In the research for cluster, a feature weighting cluster based on the affinity propagation cluster is proposed. It can reflect the data information exactly compared with the traditional method through considering the different function in the cluster for different features.Through analyzing the results of feature selection to the liver disease data, the short encoding-based multi-population genetic algorithm and the chain-like multi-population genetic algorithm can avoid the shortcoming that too many features are contained in the result and improve the classification rate of accuracy. In the cluster research, three public data sets from UCI were used, The experiment results on the three dataset showed the features weighted AP method can get higher accuracy compared with the traditional AP cluster.

【关键词】 特征选择遗传算法聚类数据挖掘
【Key words】 Feature SelectionGenetic AlgorithmClusterData mining
节点文献中: 

本文链接的文献网络图示:

本文的引文网络