节点文献
基于统计相关性的有趣关联规则的挖掘
Correlation-based Interesting Association Rules Mining
【作者】 张新霞;
【导师】 王耀青;
【作者基本信息】 武汉科技大学 , 控制理论与控制工程, 2002, 硕士
【摘要】 近年来,随着计算机技术和信息技术的发展以及数据库的广泛应用,数据挖掘已经成为机器学习、人工智能、数据库等领域的研究热点。其中,关联规则在商业中的广泛应用使得它成为数据挖掘中最活越的研究方向之一。 在关联规则挖掘中,挖掘过程可以分为两个子问题:一是产生大项目集;二是产生强关联规则。对于第一个问题,算法的复杂性是瓶颈,因为频繁集的数目和项目的数目成指数增长。所幸,对此目前已经提出了许多有效的挖掘算法,且这些算法都能在最小阀值的基础上利用规则的品质度量(quality measure)修剪巨大的搜索空间。对于第二个问题,目前的研究不太多,主要原因是忽略了一个问题,就是在产生强关联规则的同时,这些规则也必须是有趣的。通过关联规则挖掘,从大型数据库中发现了大量规则,如何选取有趣规则,是知识发现的重要内容。目前大多数的算法通常利用支持度和置信度来限定规则的强度。但在实际应用中仅考虑支持度和置信度是不够的,因为这些耗费了很大的计算代价挖掘出的强规则并不一定都是对用户有用的或者说有趣的规则,它们中有的甚至是误导的。而我们的目的就是找出有益于决策的用户感兴趣的规则,所以对于关联规则挖掘中许多规则是无趣甚至是误导的情况,文中首先对其作了分析,针对项目集中可能出现的项目间的独立和负相关情况,文中引入了概率论的统计相关概念,并在它的基础上定义了有趣度量RI,把有趣度结合到支持—信任框架的关联规则挖掘中。通过RI来约束用户不感兴趣的规则的产生。从而可以使挖掘出的规则更加有趣、有用。在对有趣度量做了理论和直观分析的同时,文中还给出了算法设计和实例验证了它的有效性。 有趣度是一个相对概念,它是依赖于领域的,所以文中有趣度量的定义并不是任何情况下都适用的。虽然本文讨论的是客观有趣度量,但在某种程度上仍然依赖于领域,例如,有的领域背景下希望发现项目正相关的关联规则,而有的领域则希望发现项目负相关的规则;对于前者,RI大于1的规则是有趣的,要保留,而对于后者,RI小于1的规则是有趣的。也就是说,有趣度是依赖于领域的。因文中是在市场货篮数据的背景下讨论的规则的有趣性,所以,项目正相关的规则是需要的。
【Abstract】 Recently, our capabilities of both generating and collecting data have been increasing rapidly. The explosive growth in data and database has generated an urgent need for new techniques and tools that can intelligently and automatically transform the processed data into useful information and knowledge. Consequently, data mining has become a research area with increasing importance. Since its introduction, Association Rule Mining, has become one of the core data mining tasks, and has attracted tremendous interest among data mining researchers and practitioners.The task of mining association rules consists of two main steps. The first involves finding the set of all frequent itemsets. The second step involves testing and generating all high confidence rules among itemsets. For the first step, computable complexity is the bottleneck of the algorithm for the number of frequent itemsets increases with the number of items exponentially. Fortunately some efficient algorithms have been presented in literatures and mostly can prune huge search space based on the minimal threshold by quality measure of 4he-Fule,-For the second step mentioned above, one of important properties is these mined rules must be interesting to the user. However, association rule mining algorithms tend to produce a huge number of rules, most of which are of no interest to the user.In this paper, we analyzed some problems existed in the association rule mining firstly. Then statistic correlation concept was introduced and based on which the rule interestingness measure was defined What we are interested in during the mining is those rules with strong item correlation. So the interesting measure introduced in this paper severed as a constraint for those independent or negative correlation rules. With it associated with the support and the confidence we can find only interesting or useful rules from data sets. From two aspects, theoretically and intuitively, we showed the rationality of the measure and gave a description of the mining interesting rules algorithm. In the end, an example was given to show the efficiency of the algorithm.Interestingness is a relative and domain specific conception, so the method introduced in this paper is not adaptable for all conditions though it is an objective measure. For example, in certain field positive correlation is considered interesting, but in another background negative correlation is our preference. In other words, interestingness is domain specific. Our discuss proceeded in the background of market basket data so rules with items positive correlated are interesting and useful for market decision.
【Key words】 data mining; association rule; support; confidence; interestingness; correlation;
- 【网络出版投稿人】 武汉科技大学 【网络出版年期】2002年 02期
- 【分类号】TP311.12
- 【被引频次】3
- 【下载频次】218