节点文献
结构数据挖掘与处理的若干问题的研究
【作者】 王晨;
【导师】 施伯乐;
【作者基本信息】 复旦大学 , 计算机软件与理论, 2005, 博士
【摘要】 目前,数据挖掘及其应用已经渗透到多个学科,并在人工智能与机器学习、数据库、模式识别、生物信息学、神经计算等领域取得了丰硕的成果。同时,数据挖掘也不仅是科学家的兴趣所在,更多地得到了政府、工业界的密切关注。通过引进数据挖掘的能力,可以大大提高生产力,取得社会的更大进步。世界上许多国家和地区的政府及工业界都希望掌握数据挖掘技术,提升国家和企业的科技含量,并最终取得领先的地位。 随着研究的深入,越来越多的问题呈现在我们面前,也提出了更高的要求。当前,复杂类型数据的挖掘需求上升,专家学者开始关注这方面的新应用和理论研究,并试图利用无结构化数据挖掘方面的经验和方法论来帮助解决新问题。而针对结构数据的挖掘与处理就是本文所致力研究的问题。 本文针对结构数据挖掘与处理目前存在的几个关键问题进行了研究,包括提高半结构化数据挖掘的速度与效率、提高图结构数据挖掘的可量测性和处理速度、约束条件下的图结构数据挖掘的方法、图结构数据索引技术。本文的创造性研究成果主要有: (1) 提出了4个频繁子树挖掘算法,分别是Chopper、XSpanner、ESMiner、ISMiner。分别采取了序列增长技术和最右路径增长技术,挖掘嵌入式子树和导出式子树模式。实验结果表明这些算法的运行效率良好,在性能上优于目前已提出的子树挖掘算法。 (2) 提出了一个新颖的子图索引结构ADI,并将其应用于频繁子图挖掘过程中,形成了图挖掘算法ADI-Mine和图挖掘应用系统GraphMiner。实验结果表明,ADI在一定程度上避免了子图同构判断的巨大代价,提高了算法的效率和可量测性。通过与目前世界上认可的最快的图挖掘算法gSpan比较,ADI-Mine无论从可量测性上还是从时间效率上,都大大优于对方。在此基础上,还提出了将ADI移植到其他图挖掘算法中的想法,进一步提高效率。 (3) 总结了目前常用的图约束条件,并根据其特性将约束分成若干类别,最后提出了带约束的图挖掘算法CabGin。实验证明,通过聚集挖掘焦点,不仅可以减少噪声结果对分析造成的影响,还可以提高挖掘效率。
【Abstract】 Recently, data mining and its applications have already come into many disciplines and achieved plentiful fruits in diversified fields, including artificial intelligence and machine learning, database, pattern recognition, bioinformatics, neural computing, and so on. It not only appeals scientists but also catches the attention from governments and industries. The governments, industrial communities, and academic fields are so keen on mastering data mining techniques that they have invested a large deal of money and energy on the corresponding research. Therefore, the progress of data mining will promote the development of science and society.With the progress of data mining techniques, more and more questions have been presented. The demand of mining on complex data is rising now. Experts have paid attention to these fields and tried to solve the problems by virtue of the experience of unstructured data mining like frequent itemsets mining. In this paper, I do the research on structured data mining and processing.In this dissertation, 4 problems standing in need of solutions are investigated, which includes improving the efficiency of semi-structured data mining, promoting the scalability of structured data mining, mining graph data with constraints, and indexing graph database. The main contributions of the dissertation are summarized as follows:Firstly, 4 algorithms, Chopper, XSpanner, ESMiner and ISMiner, have been proposed. Those algorithms mines frequent induced and embedded subtrees by virtue of method of pattern growth and rightmost path growth respectively. Experimental results show that the algorithms perform better than those algorithms presented ago like TreeMiner and FREQT.Secondly, a novel graph indexing structure of ADI is proposed. It is embedded into graph mining algorithm to improve the scalability. Experimental results show that ADI-Mine perform better than others like gSpan, the best graph mining algorithm before. Based on it, I continue to present the ideas on transplanting the ADI indexing structure into other graph mining algorithms for improving their efficiency and scalability.
【Key words】 Data Mining; Semi-structured and structured; Frequent subtree; Induced subtree; Embedded subtree; Pattern growth; Monotonic constraint; Anti-monotonic constraint; Non-monotonic constraint; Indexing; Querying; Weblog; XML; Social network; Genetic sequence;