节点文献

一种基于划分的孤立点检测算法

An Algorithm Based on Partition for Outlier Detection

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 孙焕良鲍玉斌于戈赵法信王大玲

【Author】 SUN Huan-Liang1,2, BAO Yu-Bin1+, YU Ge1, ZHAO Fa-Xin1, WANG Da-Ling1 1(School of Information Science and Engineering, Northeastern University, Shenyang 110006, China) 2(School of Information and Control Engineering, Shenyang Jianzhu University, Shenyang 110015, China)

【机构】 东北大学信息科学与工程学院沈阳建筑大学信息与控制工程学院 辽宁沈阳110015辽宁沈阳110006

【摘要】 孤立点是不具备数据一般特性的数据对象.划分的方法是通过将数据集中的数据点分布的空间划分为不相交的超矩形单元集合,匹配数据对象到单元中,然后通过各个单元的统计信息来发现孤立点.由于大多真实数据集具有较大偏斜,因此划分后会产生影响算法性能的大量空单元.由此,提出了一种新的索引结构——CD-Tree(celldimensiontree),用于索引非空单元.为了优化CD-Tree结构和指导对数据的划分,提出了基于划分的数据偏斜度(skewofdata,简称SOD)概念.基于CD-Tree与SOD,设计了新的孤立点检测算法.实验结果表明,该算法与基于单元的算法相比,在效率及有效处理的维数方面均有显著提高.

【Abstract】 Outliers are objects that do not comply with the general behavior of the data. The method of partition divides data space into a set of non-overlapping rectangular cells by partitioning every dimension into equal length. Statistical information of cells is used to find knowledge in datasets. There exists very large data skew in real-life datasets, so partition will produce many empty cells, which affects the efficiency of the algorithms. An efficient index structure called CD-Tree (cell dimension tree) is designed for indexing cells. Moreover, to guide partition and to optimize the structure of CD-Tree, the concept of SOD (skew of data) is proposed to measure the degree of data skew. Finally, the CD-Tree-based algorithm is designed for outlier detection based on CD-Tree and SOD. The experimental results show that the efficiency of CD-Tree-based algorithm and the maximum number of dimensions processed increase obviously comparing with the Cell-based algorithm on real-life datasets.

【基金】 国家自然科学基金;国家教育部高等学校优秀青年教师教学和科研奖励基金;辽宁省自然科学基金;辽宁省教育厅攻关计划基金~~
  • 【文献出处】 软件学报 ,Journal of Software , 编辑部邮箱 ,2006年05期
  • 【分类号】TP18
  • 【被引频次】58
  • 【下载频次】676
节点文献中: 

本文链接的文献网络图示:

本文的引文网络