节点文献
数据流挖掘中聚类算法的研究与实现
The Research and Realization of Clustering Algorithm in Data Streams Mining
【作者】 张帆;
【导师】 冯秀兰;
【作者基本信息】 北京林业大学 , 计算机应用技术, 2012, 硕士
【摘要】 随着互联网技术和分布式计算技术的快速发展,一种新型的数据类型-数据流应用而生,它广泛出现在了互联网信息监控、银行证券分析、无线传感器网络、天气预报以及气象监测等领域。传统的数据挖掘技术只能应用于静态、少量的数据集合,不能很容易的扩展到适应快速、无限、持续变化的数据流应用中,因此,对于数据流理论的研究和应用成为数据流挖掘的一个重要研究方向,特别是针对数据流聚类算法的研究及网络入侵检测等问题。本文首先分析了国内外在数据流聚类算法的研究现状、静态数据挖掘算法的优缺点、数据流聚类算法的优缺点,为后文的算法奠定了基础。然后,通过对已有数据流聚类算法的深入研究,实现了一种基于密度的数据流聚类算法ρ—Stream。针对闵可夫斯基距离和余弦相似性度量的特点,引入了两个新的概念:频度和数据摘要信息,并且提出了一种度量多重属性数据相似性的方法。ρ—Stream采用树的结构和动态哈希表对结点和指针进行存储,解决了算法的时间和空间复杂度问题。针对数据流聚类参数设置问题,为了使得算法能够在一定大小的内存中完成,提出了一种密度阈值函数。针对离线层聚类算法运行效率问题,提出了一种基于内存抽样的方法来发现聚类。最后,根据数据流自身的特性,设计了一个适合数据流聚类算法的网络入侵检测框架,并通过后台机器学习方式实时扩展异常数据字典。通过使用KDD CUP1999数据集进行实验,证实了本文算法的优越性,达到了预期效果。
【Abstract】 As the development of Internet and distribution computing,data stream a new kind of data type comes out, It is widely used in the field of Internet information monitoring, banks and securities analysis, unlimited sensor networks, weather forecasts and meteorological monitoring. Traditional data mining techniques can only be applied to static or a small number of data sets,but can not be easily extended to adapt to the fast, unlimited, continuous changes in the data stream applications.So,research and application of the theory of data stream becomes important, especially for data stream clustering algorithm and network intrusion detection.This article first analyzes the domestic and international research status in data stream clustering algorithm, both the advantages and disadvantages of static data mining algorithms and data stream clustering algorithm, and laid a foundation for later algorithm.Then, through in-depth study of existing data stream clustering algorithm, implementation a data stream clustering algorithm based on density p-Stream. According to Minkowski distance and the characteristics of the cosine similarity measure, introducing two new concepts:frequency and data summary information, also proposing a measure of the similarity methods of data multi-property. p-Stream using a tree structure and dynamic hash table for storing nodes and pointers to solve time and space complexity problem. In order to make the algorithm can be completed in a certain size of memory,propose a density threshold function for setting data stream clustering parameter. Off level clustering algorithm operating efficiency problem solved by a method based memory sampling to discover cluster.Finally, according to the characteristics of the data stream, design a clustering algorithm for data stream, network intrusion detection framework, and through the background of machine learning methods for real-time extension of abnormal data dictionary. Using KDD CUP1999dataset, confirmed the superiority of the proposed algorithm and achieve the desired results.
【Key words】 Data Stream Mining; Data Stream Clustering Algorithm; ρ-Stream Algorithm;