节点文献

离群检测及其优化算法研究

Research on Outlier Detection and Its Optimal Algorithms

【作者】 杨鹏

【导师】 朱庆生;

【作者基本信息】 重庆大学 , 计算机科学与技术, 2010, 博士

【摘要】 在数据集中,离群点是指那些相对于大量常规数据异常孤立的数据模式。在很多情况下离群点被认为是噪声而抛弃,但在实际应用中我们发现一些包含重要信息的数据往往就是离群点。离群检测就是利用统计学,机器学习,智能计算,可视化等多种技术来发现数据集中的离群点,供用户进行分析和处理。由于离群点可能蕴含重要知识,离群检测在预防电信和信用卡欺诈,医疗保险,市场分析,气象预测等领域有广泛的应用,相关研究将具有重要的学术和现实意义。然而面对日益复杂的大型高维数据集,如何迅速有效地发现并处理异常行为是一个具有挑战性的问题。本文尝试将聚类与分类方法用于发现数据集中的异常对象,同时研究离群检测相关的优化算法。我们提出了基于谱聚类以及RBF人工神经网络的离群检测方法,针对高维数据集定义了关键离群属性子集的概念并实现了属性约简来优化离群检测。主要工作和成果如下:①对谱聚类基本原理和典型算法做了较为全面的分析和研究,利用谱聚类的特性实现了在复杂数据集上的聚类。提出了一种改进的基于随机行走的谱聚类算法,该算法引入了密度敏感的距离量度来更精确地计算对象之间的相似性,并且通过计算随机矩阵相关特征值来自动确定数据集的最优聚类数。利用该算法获得的稳定聚类,是有效完成离群检测的前提。②首次将谱聚类用于离群检测,并通过定义扩展的多路剪切和分段常数特征向量证明了其可行性。提出了一种基于谱聚类的离群检测算法,该算法首先对数据集进行聚类,然后计算所有聚类中对象的离群因子并根据该值来确定离群点。在谱聚类过程中,利用共享邻居的邻接矩阵构造方法来获得较为稀疏的邻接矩阵,其特征向量可以用Lanczos算法来快速求解。③利用RBF人工神经网络来构造离群检测模型,该模型使用减法聚类来有效选择隐节点中心,同时获得更快的训练速度。网络训练过程中,在传统误差函数中加入了一个调整项,旨在消除隐层节点的波动。为每个输入样本定义离群度,在网络输出结果确定的情况下,可以根据离群度判断那些实际输出严重偏离其期望的样本为离群点。④针对在大型高维数据集中发现离群点效率不高的问题,我们引入粗糙集相关概念并提出了基于属性约简的离群检测方法。如果在某属性子集上得出的离群划分与在全属性集上得出的离群划分足够相似,则对于这样的数据集,可以直接在这些属性子集(即关键离群属性子集)上进行离群检测。此外,提出了高效的关键离群属性子集的查找方法,并通过实验验证了其有效性。

【Abstract】 An outlier in dataset is an observation or data pattern which is considerably dissimilar or inconsistent with the remainder of the data. In most cases, outliers are abandoned due to be considered as noise. Objects including important information, however, are outliers found in some real-life applications. Outlier detection aims to find outliers in dataset by utilizing statistics, machine learning, intelligent computing, visualization and the other technology for further analysis and study.Since the rare events may contain important knowledge, outlier detection has a number of useful applications such as in defend for communication and credit card fraud, medical insurance, market analysis and weather forecast. Thus the study on outlier detection is very significant both on research and practice. How to efficiently and effectively find and deal with abnormity in large high dimensional dataset is a challenging problem.We focus on finding abnormity in datasets with clustering and classified structure and studying the implement and optimization of key technology for outlier detection in this paper. We have proposed outlier detection method based on spectral clustering and RBF neural network, and implement attribute reduction to speed up finding outliers by utilizing rough set. The main results are outlined as follows:①The basic theory and traditional algorithms of spectral clustering are analyzed and studied roundly. Clustering on complex datasets can be implemented by using spectral method. An advanced algorithm based on random walk is proposed, which introduces the density sensitive distance metric to calculate the similarity between objects more accurately, and automatically selects the optimal clustering number according the eigenvalues of stochastic matrix. The stable cluster obtained by using such algorithm is the premise of achieving effective outlier detection.②It is the first time to apply spectral clustering for outlier detection, and its feasibility can be proved by the definition of extended multicut and piecewise constant eigenvectors. An outlier detection algorithm based on spectral clustering is proposed, which first partitions the dataset, then calculate the outlying factor of objects in each cluster and identifies the outliers according such values. In the spectral clustering process, a sparse matrix can be obtained by using shared neighborhood based adjacent matrix whose first eigenvectors can be easily computed by Lanczos method. ③An outlier detection model by using RBF neural network is constructed, which utilizes subtractive clustering algorithm for selecting the hidden node centers so as to achieve faster training speed. In the network training process, a regularization term is added in the traditional error function to minimize the variances of the nodes in the hidden layer. By defining the degree of outlier, we can effectively find the abnormal data whose actual output is serious deviation from its expectation as long as the output is certainty.④To solve the inefficient problem of finding outliers in large high dimensional datasets, an attribute reduction based detection method is proposed by introducing the concept of rough set. By defining outlying partition similarity, we can mine the outliers on the key outlying attribute subset rather than on the full dimensional attribute set of dataset as long as the similarity of outlying partition produced by them is large enough. An effective method for finding the key outlying attribute subset is proposed, and the experimental results testify its effectiveness.

  • 【网络出版投稿人】 重庆大学
  • 【网络出版年期】2011年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络