节点文献
基于集成机器学习模型的无监督异常检测方法研究
Research of Unsupervised Anomaly Detection Methods Based on Ensemble Machine Learning Model
【作者】 张佳;
【导师】 李智勇;
【作者基本信息】 湖南大学 , 计算机科学与技术, 2020, 博士
【摘要】 随着大数据时代的到来,人们不再为数据匮乏而感到困扰,反而越来越关注数据的质量问题并开始探讨从大量数据中提取最有价值信息的方法设计与理论研究。作为该系列研究的重大研究课题之一,异常检测侧重于检测和识别数据集中与大部分样本存在显著差异的异常样本,已成为在网络安全的入侵检测、机器设备的故障检测、医疗图像的癌变细胞识别、金融行业的信用卡欺诈检测等多个领域的热门研究话题。目前大多数的异常检测研究专门针对某个领域的特定异常类而设计,因此无法同时实现对不同领域的多种异常类的有效检测,从而具有较差的泛化能力。事实上,在实际应用场景中大部分异常类并不能事先获取,在其检测过程中甚至会出现多种未知的新异常。算法的泛化能力在异常检测中显得尤为重要,设计一种具有更高泛化性能的算法以适应于不同领域不同类异常的识别与检测是一项具有重要意义的任务。集成学习结合多个算法的优势来取得比单个算法更好的泛化性能,该技术在传统机器学习分类和聚类问题中展现了非常好的效果并已被验证可有效提升算法的泛化能力。而集成学习应用于异常检测的技术(简称“异常集成”)目前仍处在发展阶段,异常检测中数据类别极端不平衡和数据标签缺失是阻碍其发展的主要原因。现有研究成果通过使用集成的思想来提高某个或多个异常检测算法的泛化性能,将异常集成当成一个简单的结果组合问题,忽略了检测模型的训练过程,从而仅有有限的泛化能力。为了进一步提升异常集成算法的泛化性能,本文重点关注异常集成中基本检测模型的训练过程,从集成数据准备、集成模型训练、集成模型组合以及集成学习框架四个方面进行了系统的研究与分析。本文主要创新点概括如下:1)从数据源中选取最具代表性的正常样本集作为异常集成组件的训练数据是确保算法鲁棒性的必要手段。本文提出了一种基于集成的联合训练方法以实现样本预处理和异常评分的多次迭代优化。该方法为异常检测构建了一个涵盖样本权重计算和样本异常评估的优化模型。其中,后者得到的异常分值可用于指导前者样本权重的计算(即高异常概率的样本赋予小的权重),前者生成的具有不同权重大小的数据样本集可有效避免后者训练过程因异常样本干扰而产生的性能衰退。首先,为目标函数设计了一个基于先验知识的正则项以辅助样本的权重计算。其次,为尽可能实现异常样本具备比正常样本更高的异常分值,提出了一种基于异常分值的铰链损失函数。最后,提出了一种交替迭代方法对该集成模型进行优化求解。在多个异常检测数据集上的实验结果表明本文的方法相对于流行的算法有很大的泛化性能提升。2)在模型训练过程中考虑算法多样性需求是构建一个好的集成算法的有效途径。本文提出了一种基于多样性感知的序列集成方法,通过提升模型多样性来提高算法的异常检测效果。该方法将集成多样性分为两个部分:样本多样性和模型多样性。对于样本多样性,使用了子采样技术以实现样本初级阶段的多样性生成。对于模型多样性,设计了一种基于集成的优化模型以进一步提高集成组件的多样性。此外,提出了一种无监督的多样性度量方法以实现多样性量化评估,设计了一种异常剪枝策略以消除训练过程可能的伪异常样本。通过对样本多样性和模型多样性的同步提升,基本检测器模型能够取得更好的多样性和准确性以构建更优的异常集成算法。在多个数据集上与多种算法的对比实验中本文的方法展现了更好的异常检测效果。3)在模型组合过程中实现对多个集成组件结果的合理分配是提高集成算法最终性能的关键技术。本文提出了一种基于双层集成的无监督异常检测算法,可进一步提升算法的泛化性能并减少由子空间采样造成的信息损失。该方法提出的两层组合策略包括两个组成部分:内部集成和外部集成。第一层是内部集成用于减少信息损失,第二层是外部集成用于增强泛化能力。此外,为了实现第一层模型的再训练,设计了一种多样性损失函数。为了确保第二层组合的有效性,提出了一种新的加权组合策略。通过采用基于双层组合的学习策略,本文的方法无论是在高维和低维数据集亦或是大样本和小样本数据集中均表现出不同程度的泛化性能提升。4)在学习框架中实现对数据预处理技术、模型训练技巧和模型组合策略三个部分的联合优化是进一步提高异常集成算法泛化性能的必要条件。本文设计了一种基于积极模型的无监督序列集成框架以实现这三个组成部分在统一学习框架中的同步或迭代优化,并提出了一种基于非度量局部异常评分的自适应集成方法来实例化该框架。首先,采用基于卡方分布的样本采样方法来初始化参考模型。其次,提出了一种基于加权马氏距离的非度量异常评估方法,具体通过计算多个特征子集的局部距离的加权和来近似全局距离,以获得模型训练阶段最终的异常分值。最后,设计了一种基于异常排序的自适应组合策略以有效组合多个集成组件的结果。从多组对比实验可知,本文的方法不仅在常见的静态数据集上呈现了显著的泛化性能提升,而且在最新的动态数据集上也展现了一定的发展潜力。总的来说,本文重点关注集成的四个重要组成部分,即集成数据准备、集成模型训练、集成模型组合和集成学习框架,深入分析了每个组成部分面临的挑战和存在的不足,设计了一系列异常集成算法,提出了一种通用的序列集成框架,并取得了良好的异常检测性能。本文的研究思路以及所获得的相关研究成果对于该领域未来的深入研究有很好的参考价值。
【Abstract】 With the coming of the big data,the lack of data will no longer be troubled.In contrast,there is a growing concern about the quality of data.A few researchers begin to develop method design and theory analysis to mine most valuable information from a large number of data.Anomaly detection is one of major hot topics among them,which can effectively detect and recognize anomalies that have significant differences with most samples.It has been widely used in multiple domains,such as intrusion detection of network security,fault detection of machine equipment,cancer cell identification of medical images,credit card fraud detection of financial industry and so on.Most anomaly detection studies focus on design a specific method of a certain domain.Therefore,existing anomaly detection methods cannot successfully detect multiple anomalies in various domains and have poor generalization ability.In fact,most of anomalies will be hard to obtain in the practical application,and novel anomalies will also appear during the detection process.The generalization performance is particularly important for anomaly detection.Naturally,it is an important task to design an anomaly detection method that can be effective to solve anomaly detection problems in various domains.Ensemble learning obtains better performance than single method by combining advantages of multiple algorithms.This technology has shown good performance in traditional machine learning problems,such as classification and clustering,and has been verified the effectiveness in improving the generalization ability of related methods.However,the research of applying ensemble learning to anomaly detection(called anomaly ensemble)is extremely challenging.The missing of data label and the extreme imbalance of data categories for anomaly detection are two main obstacles of developing anomaly ensemble methods.Existing studies use ensemble learning to improve the generalization performance of one or more anomaly detection methods.Especially,they merely treat anomaly ensemble as a simple combination problem and usually ignore the model training phase leading to limited generalization ability.To further improve the generalization ability of anomaly ensemble methods,this paper focuses on the training phase of base detection model of anomaly ensemble,and conducts on a systematic study and analysis from four aspects,namely ensemble data preparation,ensemble model training,ensemble model combination,and ensemble learning framework.The main contributions of this paper are summarized as follows:1)It is necessary to select most representative normal samples from the original dataset to prepare the training of each anomaly ensemble component.This paper proposes an ensemble based joint training framework for anomaly detection to implement the iterative optimization of sample preprocessing and anomaly scoring.This method builds an optimization model that contains the computation of sample weight and the evaluation of sample abnormality.Specifically,the anomaly score obtained by the latter is used for the computation of the sample weight of the former(i.e.,the samples with high abnormal probability are given a small weight),and the dataset with different value of weight generated by the former can effectively avoid the performance degradation of the latter,where the degradation is mainly caused by the interference of abnormal samples in the training phase.First,to compute sample weight,a prior knowledge based regularization term is proposed for the objective function.Second,to achieve the goal that the abnormal samples have higher abnormal scores than the normal samples,an anomaly score based hinge loss function is also added in this objective function.Finally,an alternative iterative method is designed to optimize the ensemble model.Experimental results based on various anomaly detection datasets show that the proposed method has a great performance improvement when compared with the popular algorithms.2)In the model training phase,it is an effective way to build a good anomaly ensemble method by fully considering the diversity requirement.This paper proposes a diversity aware based sequential anomaly ensemble method to further improve anomaly detection method by strengthening the model diversity.The ensemble diversity can be divided into two parts: sample diversity and model diversity.For the sample diversity,subsampling is utilized to generate the sample diversity in the primary phase.For the model diversity,an ensemble based optimization model is designed to further improve the diversity of each ensemble component.In addition,this paper proposes an unsupervised diversity measurement method to realize the quantitative evaluation of diversity,and designs an anomaly pruning strategy to remove the pseudo abnormal samples in the training phase.Combined with sample diversity and model diversity,the proposed method has exhibited better generalization performance.Meanwhile,compared with various algorithms on multiple datasets,this method shows better anomaly detection results.3)In the model combination phase,it is important to improve the final ensemble performance by employing a reasonable combination of multiple ensemble components.This paper proposes a bi-level ensemble learning based unsupervised anomaly detection method to further improve the generalization performance of the algorithm and reduce the information loss caused by subspace sampling.The two level combination strategy of this method contains two main components: internal integration and external integration.The first level is internal integration utilized to reduce information loss,and the second level is external integration used to improve generalization ability.In addition,a diversity loss function is designed to realize the model retraining of the first layer.A novel weighted combination strategy is proposed to ensure an effective integration of the second level.Based on the two-level learning strategy,the proposed method has shown different degrees of performance improvement,not just on datasets with high or low dimensions,but on datasets with small or large size samples.4)In the learning framework,it is necessary to realize the joint optimization of the data preprocessing technology,model training skills and model combination strategy to further improve the generalization performance of anomaly ensembles.This paper designs an eager model based unsupervised sequential ensemble framework to consider these three components in a unified learning framework,and proposes a non-metric local anomaly score based adaptive ensemble method to instantiate this framework.First,a Chi-square distribution based sampling method is utilized to initialize the reference model.Second,a weighted Mahalanobis distance based non-metric anomaly evaluation method is proposed,where the weighted sum of local distances of multiple feature subsets is used to replace the global distance and form the final anomaly score.Finally,an anomaly ranking based adaptive composition strategy is designed to effectively combine the results of multiple ensemble components.Based on multiple comparative experiments,the proposed method not only works well on common static datasets,but also shows considerable potentials on dynamic datasets.In general,this paper focuses on four important components of anomaly ensemble,namely data preparation,ensemble model training,ensemble model combination,and ensemble learning framework,deeply analyzes the possible challenges and shortcomings of each component,designs several anomaly ensemble methods,proposes a general sequential ensemble framework,and obtains comparable anomaly detection performance.Therefore,this paper will be one of important references for the in-depth research in the future.
【Key words】 Anomaly detection; Ensemble learning; Unsupervised learning; Joint training; Diversity; Generalization ability; Bi-level ensemble; Sequential ensemble;