节点文献
数据挖掘中约简技术与属性选择算法的研究
Research on Reduct Technology and Feature Selection Algorithm in Data Mining
【作者】 刘辉;
【导师】 李雄飞;
【作者基本信息】 吉林大学 , 计算机软件与理论, 2006, 硕士
【摘要】 约简以及属性选择都是数据挖掘中的重要技术,人们一直致力于寻求快速的约简算法,以及如何删除无关属性与冗余属性。本文就这两方面问题进行了研究与讨论。在对数据挖掘进行概述之后,提出了基于素数性质的布尔函数约简算法,然后综述前人在属性选择领域所作的工作,最后提出了两个属性选择算法。在第二章提出了基于素数性质的布尔函数约简算法,主要思想是用素数表示布尔变量,以算术运算代替逻辑运算。本文重新定义了布尔系统,并且用新系统中的公理与规则描述了原有布尔系统的定律,因此,基于素数性质的布尔系统具有完备性。该算法运用于粗糙集属性约简中,可以减少空间消耗,同时提高时间效率。第三章对属性选择进行综述,总结该领域的现有成果,描述一些经典算法,并分析其优劣。第四章首先介绍了列联表,并且论述了列联表与皮尔逊卡方统计量的关系。然后论述了常用的相关性度量准则。之后提出了基于卡方统计量的属性选择算法,算法的主要思想是计算各个属性与分类属性的卡方值,卡方值越大,相关性越强,将属性按照与分类属性的相关性,划分为三个等级,分别为强相关、相关和弱相关,在强相关属性子集中找到最弱强相关属性,以最弱强相关为判定相关的阈值,在弱相关属性子集中找到最强弱相关属性,以最强弱相关为判定独立的阈值。将强相关属性子集中的冗余属性删除,得到强相关约简属性子集,然后将相关属性子集中与强相关约简属性子集所有属性都独立的属性选择出来,将弱相关属性子集完全删除。最后采用NIPS2003数据集为实验数据,给出实验结果,并将结果提交到NIPS2003网站,由该网站平台给出各项参数的评价结果。第五章提出了相关性概率的概念。将卡方值转换为独立性置信水平,求得相关性概率。然后提出了位差的概念,以及正位差属性、零位差属性和负位差属性等概念。之后提出了基于位差的属性选择算法,其主要过程是将负位差属性删除。该算法是针对离散型属性的,因此,在进行属性选择之前,采用Kmeans算法以无监督的方式将各个属性离散化。最后,将NIPS2003数据集的实验结果提交到该网站。
【Abstract】 Data mining, as a multidisciplinary subject, is developing rapidly in recentyears. It is involved with database, statistics, artificial intelligence, machinelearning and so on. Its major task is to extract valuable knowledge and obtainmore available information.Knowledge is the basis of artificial intelligence. Symbolic reasoning is amain method to seek the solution. Reduct algorithms of attributes are based on adiscernibility matrix in rough set. Boolean reductions are major functions of thealgorithms. In digital circuit designs Boolean functions must be reduced. It isapparent that Boolean manipulations and Boolean reducts are applied in manyfields.After some algorithms for reduct of Boolean functions have been studied, analgorithm for reduct of Boolean functions based on primes is introduced in section2. Its main idea is that Boolean variables are represented by primes, and a basicconjunction is denoted by an ordered pair of integers. In the ordered pair, the firstelement, which is a product of all primes corresponding to all variables in thebasic conjunction, denotes a meet of all variables, and the second element, whichis a product of all primes corresponding to all variables’ complements in the basicconjunction, denotes a meet of all variables’ complements. If there are novariables either variables’ complements in Boolean functions, the correspondingelement of the ordered pair is assigned to 1. Boolean calculation is replaced byarithmetic manipulation. So that the memory space is saved and the efficiency isimproved. The Boolean system is redefined, and the laws are described by axiomsand rules in new Boolean system. It is proved that the new Boolean system iscomplete. The algorithm for reduct of Boolean functions based on primes isapplied to rough set. The results indicate that the algorithm is efficient.With data, especially dimension of feature, is increasing, data mining has tosuffer from curse of dimensionality. So, the technology of feature selectionappeared. In real-world datasets, there are many features that are irrelevant orredundant to the target concept. These attributes should considerably slow the rateof data mining algorithms. Especially redundant features would worsen thepredictive accuracy of mining algorithms. So that feature selection is one ofimportant tasks in data mining. It can speed up a data mining algorithm, improvepredictive accuracy.After survey of feature selection, a feature selection algorithm based on χ~2statistic (FSBC) and another feature selection algorithm based on potentialdeviation (FSBPD) are proposed, in which χ~2 statistic is involved. χ~2 value isdirect utilized in the first one, and in the second one, χ~2 value is transformed intosignificance level of independence.In section 4, crosstab is introduced first, and the relation between Pearson χ2statistic and crosstab is discussed. After that, a feature selection algorithm basedon χ2 statistic (FSBC) is presented. Its main idea is to calculate Pearson χ2 statisticof each feature and class by crosstab. The larger the value is, the higher thecorrelation is. According to correlation to class, all features whose χ2 value is notequal to 0, are partitioned into three subsets, strong relevant set, relevant set, andweak relevant set. Find the weakest relevant feature in strong relevant set, andfind the strongest relevant feature in weak relevant set. The weakest of strongcorrelation is threshold of correlation, and the strongest of weak correlation isthreshold of independence. The three sets are treated differently. Redundantfeatures are deleted in strong relevant set. The reduced strong relevant set, inwhich pairwise features are independent, is obtained. Relevant set is filtered. Ifsome feature Fi in relevant set is correlative to any feature in the reduced strongrelevant set, Fi is deleted. The reduced relevant set is obtained. The weak relevantset is deleted completely. The subset of feature selection is the union of thereduced strong relevant set and the reduced relevant set. FSBC based on filtermodel which is more efficient than wrapper model, is independent of a certainmining algorithm. So it is universal. In the algorithm, there are no parameterswhich are difficult to choose. The result of FSBC is objective.In section 5, Correlation Probability (CP) is proposed. Calculate χ2 value ofeach feature and class via crosstab. χ2 value is transformed into significance levelof independence. CP is described by significance level of independence. In thiswork, Potential Deviation (PD) is also presented. According to correlation to classand reference feature, features in some subset are ordered in two lists. In the twolists, PD is position deviation of a feature Fi. According to definition of PD,features are divided into positive PD features, zero PD features and negative PDfeatures. Positive PD features are more correlative to class and less redundant tothe reference feature. Zero PD features are as correlative to class as redundant tothe reference feature. Negative PD features are more redundant. A featureselection algorithm based on PD (FSBPD) is presented. Its major task is to deletenegative PD features. Its features must be discrete. So, first of all, all features arediscretized via Kmeans. Because major task is feature selection, discretization isunsupervised, and distribution of data discretized tries to as same as before.FSBPD can remove redundant features. It can not warrant that irrelevant featurescould be removed, because it does not deal with irrelevant features. FSBPD is alsono parameters based on filter model.At the last of section 4 and section 5, the results are presented. They indicatethat the two feature selection algorithms can select good subset, but there are stillsome shortages to be studied further.In FSBC, compare of χ2 values has a limit that all degrees of freedom shouldbe consistent. It should be considered more that how to compare and avoid limitof degrees of freedom. The accuracy of results is dependent on the weakest ofstrong correlation and the strongest of weak correlation. How to accurately choosethe two values is also need to be discussed theoretically.FSBPD can remove redundant features effectively. However, it could notalways remove irrelevant features. So, it should be discussed how to decide athreshold to remove irrelevant features.
- 【网络出版投稿人】 吉林大学 【网络出版年期】2006年 09期
- 【分类号】TP311.13
- 【被引频次】16
- 【下载频次】559