节点文献

基于投影数组和加权FP-tree的频繁项集挖掘算法研究

Research on Frequent Itemsets Algorithm Based on Projection Array and FP-tree

【作者】 曹海燕

【导师】 何海涛;

【作者基本信息】 燕山大学 , 计算机软件与理论, 2010, 硕士

【摘要】 频繁项集挖掘是数据挖掘领域中一个比较关键的问题。然而,从大型稠密数据集中挖掘频繁项集存在三个主要的瓶颈问题:第一,算法的挖掘效率不是很高;第二,产生的频繁项集的数量太多;第三,没有采用合理的约束思想,不能有效的挖掘用户兴趣模式。本文针对这些问题,将研究重点放在频繁项集挖掘算法上,其研究成果可广泛应用于客户购买行为模式预测、序列分析和软件安全分析等领域。首先,本文提出了基于投影数组的频繁项集挖掘算法MFIPA。基于垂直和水平混合数据格式,通过交集操作找到与单个频繁项共同发生的项集,产生投影数组PArray;然后,通过单个频繁项与其投影的非空子集合并及深度优先搜索策略的使用,挖掘所有的频繁项集。其次,为了减少频繁项集的数量,设计了一个新颖的频繁闭项集挖掘算法FCIL-Mine。基于投影数组,首先提出了频繁闭项集框架数据结构FCIL,该框架主要是用来存储频繁闭项集的一些信息。然后,通过哈希检测和包含检测剪枝策略的使用,进而挖掘所有的频繁闭项集。最后,提出了一个基于加权FP-tree及长度递减支持度约束的加权频繁项集挖掘算法LWFI-Mine。该算法可以有效的挖掘满足用户兴趣的项集。首先通过扫描数据库,构造数据结构加权FP-tree。然后提出加权最小有效扩展性质WSVE及基于此性质的三种剪枝策略:事务剪枝、结点剪枝和路径剪枝,缩小了FP-tree的搜索空间,进而挖掘所有满足约束的频繁项集。本文使用C++语言对上述算法进行实现,采用稀疏的人工数据集T40I10D100K和稠密的真实数据集Connect进行频繁项集挖掘实验研究。

【Abstract】 Frequent itemsets mining is a crucial problem in the field of data mining. But there are three main difficult problems when mining frequent itemsets from large dense database. First, the efficiency of algorithms is not very high; Second, large numbers of frequent itemsets will be generated; Third, few algorithms refer to the reasonable constraint method, so they can’t mine interesting patterns efficienctly. To resolve these problems, this paper has mainly focused on the research of algorithms for mining frequent itemsets. These algorithms can be widely used in customer’s purchasing behavior pattern forecast, sequence analysis and software security analysis and so on.Firstly, a new frequent itemsets mining algorithm MFIPA is proposed based on projection array. Making use of data horizontally and vertically, those itemsets that co_occurence with single frequent items are found by computing intersection, and projection array PArray is constructed. Then all of the frequent itemsets can be mined by connecting the single frequent item with every nonempty subsets of its projection and using depth-first search strategy.Secondly, to decrease the number of frequent itemsets, a novel algorithm called FCIL-Mine for mining frequent closed itemsets is proposed in this paper. Firstly, a new data structure frequent closed itemsets lattice FCIL is proposed based on PArray, which is used for maintaining the essential information of frequent closed itemsets. Then, all of the frequent closed itemsets can be found by using hash table check and subsume-check pruning strategy.Lastly, a novel algorithm LWFI-Mine for mining weighted frequent itemsets with length decreasing support constraint is presented, which can mine interesting patterns with users’requirement efficienctly. Firstly, a data structure wighted FP-tree is constructed by scanning database. Then the notion of the weighted smallest valid extension (WSVE) property is proposed, based on WSVE, transaction pruning, node pruning and path pruning are applied to reduce the search space of FP-tree. Finally, all of frequent itemsets with constraint can be mined.Our algorithm is written in C++, sparse synthetic dataset T40I10D100K and dense real dataset Connect are adopted for mining frequent itemsets in the experiments.

  • 【网络出版投稿人】 燕山大学
  • 【网络出版年期】2012年 02期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络