节点文献

基于网格和密度的数据流聚类研究

Research on Data Stream Clustering Based on Grid and Density

【作者】 李晓鹏

【导师】 张晓芳;

【作者基本信息】 华中科技大学 , 计算机软件与理论, 2012, 硕士

【摘要】 当前,由于网络入侵检测、实时监控系统以及web上用户的点击流数据等等动态的应用环境下不断地形成时序的、海量的、迅速变化的以及潜在无穷的数据流,对于数据流的挖掘研究已经变得非常重要并且富有实用价值。聚类分析作为数据挖掘领域当中一个非常重要的问题,目前已经被广泛地进行了研究。但是数据流的模型并不等同于传统的数据集,这个时候新的要求以及挑战就随之而产生了。通过对传统的聚类方法研究,发现存在的数据流聚类算法如CluStream是基于k-means算法的。这些聚类算法对于找到任何形状的聚类是不合适的,并且不能处理异常点。进一步而言,它们需要k的值以及用户特定的时间窗口。而基于网格与密度的聚类方法拥有非常多可以应用到数据流处理的相关特点,比较容易实现数据流的聚类相关处理。因而,在对基于网格与密度的传统聚类算法进行研究以及改进的基础上,从聚类的过程中所需要处理的数据集的动态特性出发,对基于网格与密度的数据流聚类方法进行了相关研究并提出GDCLUS,一种用基于密度的方法来进行数据流的聚类。这种算法运用在线组件将每一个输入数据记录映射到一个网格中,而离线组件主要采用最小生成树的思想来进行网格的聚类。这种算法采用了一种密度衰减的技术来获取数据流的动态变化,通过发现衰减因子,数据密度以及聚类结构之间复杂的关系,算法能够有效地实时产生并且调整聚类。进一步地,将改进的金字塔框架运用到数据流在线组件数据筛选,这种技术,在没有降低聚类质量的前提下,使得高速率的数据流聚类更加可行。实验结果表明,算法有优秀的质量和效率,能够发现任意形状的聚类,并且能够准确识别实时数据流的进化特征。最后,对于实际的数据流相关应用领域,对于算法的相关性能进行了测试,并在用于网络入侵检测的KDDCup99数据集上进行了相关实验,验证了算法的可行性。

【Abstract】 Currently, as network intrusion detection, real-time monitoring system, and user’sclicking stream data on the web, etc continuously generate time-bounded, large scale,fast-changing and infinite data stream, it is very important and useful to research the areaof data mining for data stream. Clustering as a very important issue in data mining area,has been widely studied right now. But the model of data stream is not equal to thetraditional data set, new demands and challenge generate.This paper studied traditional clustering methods, finding that existing data streamclustering algorithm like CluStream is based on k-means algorithm. Those clusteringalgorithms are not suitable to find clusters of any shape, and can’t handle exception data.Furthermore, they need the value of k and user-specified time window. But clusteringmethod based on grid and density has many features to be used to data stream handling, itis easy to realize data stream clustering. Thus, this paper studies traditional algorithmsbased on density, and proposes GDCLUS, considering the dynamic feature of data sets.This algorithm uses online component to map every input data record to one grid, butoffline component clusters grid using the method of minimum spinning tree. Thisalgorithm uses density decay technique to capture the dynamic change of data stream. Todiscover the relationship between decay factor, data density and clustering structure,thisalgorithm can effectively generate and adjust clusters. Furthermore, we use the improvedtime framework to choose data online to improve space and time efficiency of clustering,this technique makes data stream clustering more feasible on the premise of not reducingthe quality of clustering. The experiment result shows it has great quality and efficiency, itcan discover clusters of any shape and detect evolving feature of data stream.At last, this paper tests relative functions of this algorithm on a real data streamapplication area, and conducts experiments on KDDCup99used for web intrusiondetection, improving the feasibility of this algorithm.

  • 【分类号】TP311.13;TP393.08
  • 【被引频次】3
  • 【下载频次】189
节点文献中: 

本文链接的文献网络图示:

本文的引文网络